Federation & NATS Troubleshooting

This guide covers solutions for common issues when using FraiseQL’s federation and NATS capabilities.

Gateway Troubleshooting

Subgraph unreachable

Symptoms: GatewayError: Failed to fetch SDL from subgraph 'orders' at startup, or circuit breaker opening during operation.

Causes:

Subgraph not running or not reachable from the gateway host
Wrong URL in gateway.toml
Network/firewall issue

Solutions:

# 1. Verify subgraph is reachable from the gateway host
curl http://order-service:4002/health

# 2. Check gateway.toml subgraph URLs
cat gateway.toml | grep url

# 3. Test the GraphQL endpoint directly
curl http://order-service:4002/graphql \
  -H "Content-Type: application/json" \
  -d '{"query":"{ _service { sdl } }"}'

Type ownership conflicts

Symptoms: GatewayError: Type 'User' is defined in multiple subgraphs: users, orders

Causes: The same type name appears in multiple subgraphs. The built-in gateway requires each type to be owned by exactly one subgraph.

Solutions: Ensure each type is defined in only one subgraph. If both services need a User type, have the non-owning service reference it via @key entity resolution instead of redefining it.

Entity resolution timeouts

Symptoms: Queries that span multiple subgraphs time out or return partial results.

Solutions:

# Increase circuit breaker recovery window
[gateway.circuit_breaker]
failure_threshold = 10
recovery_timeout_secs = 60

# Check subgraph latency for _entities queries
curl -w "@curl-format.txt" http://order-service:4002/graphql \
  -H "Content-Type: application/json" \
  -d '{"query":"{ _entities(representations: [{__typename: \"Order\", id: \"...\"}]) { id } }"}'

Invalid SDL from _service Introspection

Symptoms

ERROR: INVALID_GRAPHQL
  subgraph "my-service" SDL is not valid GraphQL

The SDL contains issues like:

str instead of String, int instead of Int
'Order' | None instead of Order
Duplicate fields in generated WhereInput types

Workaround

Use fraiseql compile --sdl to export the schema as SDL directly from the compiler (bypasses the runtime endpoint):

fraiseql compile schema.py --sdl > subgraph.graphql

Alternatively, use __schema introspection instead of _service { sdl }:

curl localhost:4001/graphql \
  -H "Content-Type: application/json" \
  -d '{"query":"{ __schema { types { name fields { name type { name kind ofType { name } } } } } }"}' \
  | jq > schema-introspection.json

Federation Issues

Connection Issues

”Cannot connect to database X”

Symptoms: DatabaseConnectionError: Failed to connect to database 'inventory'

Causes:

Network connectivity issue
Database server is down
Wrong credentials in config
Firewall blocking connection

Solutions:

# 1. Test database connectivity
curl http://localhost:8080/health

# 2. Check configuration
cat fraiseql.toml | grep -A 5 "\[database\]"

# 3. Verify credentials
echo $INVENTORY_DATABASE_URL  # Check URL format

# 4. Test manual connection
psql $INVENTORY_DATABASE_URL -c "SELECT 1"

# 4. Enable connection debug logging — set environment variable:
# RUST_LOG=debug
#
# Or configure the database in fraiseql.toml:
[database]
url = "${INVENTORY_DATABASE_URL}"
pool_max = 10

“Pool exhausted - no available connections”

Symptoms: PoolExhaustedError: No available connections in pool for database 'inventory'

Causes:

Pool size too small for load
Connections leaking (not being released)
Long-running queries blocking other requests
Deadlock in pool acquisition

Solutions:

# 1. Increase pool size
[database]
pool_min = 2
pool_max = 50  # Increased from default 20
connect_timeout_ms = 30000  # Timeout waiting for connection
idle_timeout_ms = 3600000  # Close idle connections after 1 hour

# 2. Monitor connection pool via the metrics endpoint:
# GET /metrics → fraiseql_database_pool_active, fraiseql_database_pool_idle

# 3. Identify slow queries — set slow query threshold in fraiseql.toml:
# [database]
# log_slow_queries_ms = 1000  # Log queries slower than 1 second
# Or use RUST_LOG=debug for full query logging.

Query Issues

”Timeout in federated query”

Symptoms: FederationTimeoutError: Federated query to 'inventory' timed out after 5000ms

Causes:

Network latency between databases
Query on remote database is slow
Default timeout too aggressive

Solutions:

# 1. Increase federation timeout
[federation]
default_timeout = 10000  # Increased from 5000ms
batch_size = 50  # Smaller batches = faster queries

# 2. Per-database timeout
[federation.database_timeouts]
inventory = 10000
payments = 15000  # Slower database needs more time

# 2. Configure federation batch size and timeout in fraiseql.toml:
[federation]
batch_size = 100
default_timeout = 10000

# The Python type declares the shape only — no database routing in the decorator:
@fraiseql.type
class Order:
    id: ID
    items: list[OrderItem]  # Federated from inventory DB — see fraiseql.toml

-- 3. Add database indexes on the inventory database:
CREATE INDEX idx_tb_order_item_fk_order ON tb_order_item(fk_order);

“Circular reference in federation”

Symptoms: CircularReferenceError when loading schema or executing deeply nested queries.

Causes:

Bidirectional federated references
Deeply nested federated queries

Solutions:

# BAD: Circular reference — avoid federating back to the originating type
@fraiseql.type
class Order:
    items: list[OrderItem]  # Federated from inventory DB

@fraiseql.type
class OrderItem:
    order: Order  # Circular! OrderItem federates back to Order

# GOOD: Only federate in one direction
@fraiseql.type
class Order:
    items: list[OrderItem]  # Federated from inventory DB (one way only)

@fraiseql.type
class OrderItem:
    order_id: ID  # Just store the ID — don't federate back to Order

“Inconsistent foreign keys across databases”

Symptoms: ForeignKeyError: Order references Product ID that doesn't exist

Causes:

Data deleted in one database but not cascaded
Race condition between databases
Data inconsistency

Solutions:

# The Python type declares the shape only — no runtime DB access in Python
@fraiseql.mutation(sql_source="fn_add_item_to_order", operation="CREATE")
def add_item_to_order(order_id: ID, product_id: ID) -> OrderItem:
    """Add item to order. Validation is handled in the SQL function."""
    pass

-- Validation belongs in the SQL function (fn_add_item_to_order).
-- The function can return a mutation_response with status 'failed:not_found'
-- if the product doesn't exist:
--
-- CREATE FUNCTION fn_add_item_to_order(p_order_id UUID, p_product_id UUID)
-- RETURNS mutation_response AS $$
-- DECLARE
--   v_product_exists BOOLEAN;
--   v_result mutation_response;
-- BEGIN
--   SELECT EXISTS(SELECT 1 FROM tb_product WHERE id = p_product_id)
--     INTO v_product_exists;
--   IF NOT v_product_exists THEN
--     v_result.status := 'failed:not_found';
--     v_result.message := 'Product not found';
--     RETURN v_result;
--   END IF;
--   -- ... insert logic
-- END;
-- $$ LANGUAGE plpgsql;

-- Enforce referential integrity at the database level:
ALTER TABLE tb_order_item
    ADD CONSTRAINT fk_tb_order_item_fk_order
        FOREIGN KEY (fk_order) REFERENCES tb_order(pk_order),
    ADD CONSTRAINT fk_tb_order_item_fk_product
        FOREIGN KEY (fk_product) REFERENCES tb_product(pk_product);

Saga/Transaction Issues

”Saga compensation failed”

Symptoms: SagaCompensationError: Compensation step 'reserve_inventory' failed

Causes:

Compensation function has bugs
Database state changed unexpectedly
Compensation takes too long (timeout)

Solutions:

-- Saga compensation is implemented as SQL functions.
-- The fn_ function should be idempotent and return a mutation_response.
-- Example: make fn_release_reservation idempotent:
CREATE OR REPLACE FUNCTION fn_release_reservation(p_id UUID)
RETURNS mutation_response AS $$
DECLARE v_result mutation_response;
BEGIN
    -- Idempotent: only update if not already released
    UPDATE tb_reservation
       SET status = 'released'
     WHERE id = p_id AND status != 'released';

    v_result.status  := 'success';
    v_result.message := 'Reservation released';
    RETURN v_result;
END;
$$ LANGUAGE plpgsql;

# Configure saga/observer timeout in fraiseql.toml:
[observers]
backend = "nats"
nats_url = "nats://localhost:4222"

“Saga stuck in pending state”

Symptoms: Order created but saga never completes; stuck in pending status.

Causes:

One saga step is hanging
Network issue between databases
Database deadlock

Solutions:

# 1. Monitor saga/mutation progress via structured logs.
# Set RUST_LOG=debug to see each request with its requestId.
# Use pg_notify observers in fraiseql.toml to track step completions:
# [observers]
# backend = "nats"
# nats_url = "nats://localhost:4222"

-- 2. Check stuck sagas in database
SELECT *
FROM tb_saga_execution
WHERE status = 'pending'
  AND created_at < NOW() - INTERVAL '5 minutes'
ORDER BY created_at DESC;

-- 3. Manual cleanup of stuck sagas (as a SQL function called via mutation)
-- Define a fn_cleanup_stuck_saga SQL function that:
-- - Validates the saga is in pending state
-- - Marks it as compensated or triggers compensation steps
-- - Returns a mutation_response with the outcome

# Expose the cleanup as a FraiseQL mutation (compile-time definition only):
@fraiseql.mutation(sql_source="fn_cleanup_stuck_saga", operation="CUSTOM")
def cleanup_stuck_saga(saga_id: ID) -> bool:
    """Manually trigger compensation for stuck saga."""
    pass

Performance Issues

”Federated queries are slow”

Symptoms: Query with federated field takes 10+ seconds

Causes:

Network latency
Missing indexes
Cartesian product (N+1 problem)
Query hitting large tables

Solutions:

# 1. Check if federation is batching correctly
@fraiseql.query
def orders_with_items(limit: int = 100) -> list[Order]:
    """
    With batching: Should be 2 queries total
    - 1 query: SELECT id, data FROM v_order LIMIT 100
    - 1 query: SELECT id, data FROM v_order_item WHERE fk_order IN (...)
    """
    return fraiseql.config(sql_source="v_order")

# 2. Denormalize to reduce federated queries
@fraiseql.type
class Order:
    id: ID
    item_count: int  # Denormalized count from SQL view — avoids federation

    # Federated from inventory DB — configured in fraiseql.toml
    items: list[OrderItem]

# 3. Use selective queries to avoid full federation
@fraiseql.query
def order_summary(id: ID) -> OrderSummary | None:
    """
    Query a summary view instead of the full federated type
    when only aggregate fields are needed.
    """
    return fraiseql.config(sql_source="v_order_summary")

# Enable query logging to verify batching — set via environment variable:
# RUST_LOG=debug fraiseql run

-- 4. Add indexes on the inventory database:
CREATE INDEX idx_tb_order_item_fk_order_fk_product
    ON tb_order_item(fk_order, fk_product);

NATS Issues

Connection Issues

”NATS connection refused”

Symptoms: NatsConnectionError: Failed to connect to NATS server

Causes:

NATS server not running
Wrong URL/port
Firewall blocking

Solutions:

# 1. Check NATS server status
nats server info

# 2. Test connection
nats ping

# 3. Check configuration
cat fraiseql.toml | grep -A 3 "\[nats\]"

# 4. Verify URL format
echo $NATS_URL  # Should be: nats://host:4222

# 5. Start NATS if not running
docker run -it --rm -p 4222:4222 nats

“NATS authentication failed”

Symptoms: AuthorizationError: NATS authentication failed

Causes:

Wrong token/credentials
Expired credentials
Insufficient permissions

Solutions:

# 1. Update credentials
[nats.auth]
type = "token"
token = "${NATS_TOKEN}"  # Ensure env var is set

# 2. Verify token
echo $NATS_TOKEN

# 3. Use NKey authentication (more secure)
[nats.auth]
type = "nkey"
nkey = "${NATS_NKEY}"

# 4. Generate new credentials
nats user create fraiseql-user
nats nkey gen user -o fraiseql.nk  # NKey

JetStream Issues

”JetStream stream not found”

Symptoms: StreamNotFoundError: Stream 'orders' not found

Causes:

Stream not created
Stream name mismatch
Configuration issue

Solutions:

# 1. List existing streams
nats stream list

# 2. Check stream configuration
nats stream info orders

# 3. Create missing stream
nats stream add orders \
    --subjects "fraiseql.order.>" \
    --max-msgs 1000000 \
    --max-bytes 10GB \
    --retention limits

# 4. Ensure stream is configured
[nats.jetstream.streams.orders]
subjects = ["fraiseql.order.>"]
replicas = 3
max_msgs = 1000000
max_bytes = 10737418240

“Consumer lag is high”

Symptoms: Consumer far behind in processing; queue backs up

Causes:

Consumer processing is slow
Consumer crashed/restarted
Not enough instances of consumer

Solutions:

# 1. Check consumer status
nats consumer info orders order-processor

# Output shows:
# Pending: 50000  # Many messages waiting
# Delivered: 1000
# Acked: 800

# 2. Increase processing capacity
# Scale up consumer service: 1 instance -> 3 instances

# 3. Check consumer queue group
nats consumer info orders order-processor

# 4. Increase ack wait if processing is slow
[nats.jetstream.consumers.order-processor]
ack_wait = "60s"  # Increased from 30s

# 5. Configure NATS observers in fraiseql.toml.
# FraiseQL's Rust runtime handles NATS event dispatch — not Python.
# Use the [observers] section to configure subjects and topics:
[observers]
backend = "nats"
nats_url = "nats://localhost:4222"

“Messages not being delivered”

Symptoms: Event published but subscribers don’t receive it

Causes:

Subscriber not running
Subject mismatch
Consumer has unprocessed limit

Solutions:

# 1. Check consumer status
nats consumer info orders order-processor

# Look for:
# - NumPending (messages waiting)
# - NumAckPending (unacked messages)

# 2. Check subject matches
# Published to: fraiseql.order.created
# Subscribed to: fraiseql.order.>  (match)
# Subscribed to: orders.created    (no match)

# 3. Verify FraiseQL is running and observers are configured in fraiseql.toml.
# RUST_LOG=debug will log received NATS events.
# Look for "nats: received message on subject fraiseql.order.created" in the output.

# 4. Check max deliver limit
# If message is redelivered more than max_deliver times,
# it goes to dead letter queue
[nats.jetstream.consumers.order-processor]
max_deliver = 3  # Redelivered max 3 times

# Check dead letter queue via NATS CLI:
nats consumer info orders fraiseql-dlq
nats stream view fraiseql-dlq

Event Processing Issues

”Events are processed out of order”

Symptoms: Status changed events arrive before creation event

Causes:

Multiple consumer instances processing same events
Network reordering
Consumer group distributing across instances

Solutions:

# 1. Configure NATS partitioning in fraiseql.toml so the same order ID
# always routes to the same partition, preserving ordering:
[nats.partitions]
enabled = true
key = "order_id"  # Same order always goes to same partition
count = 8

# 2. Use a single-consumer durable subscriber (no queue group)
# so events are processed strictly in sequence per subject:
nats consumer add orders order-seq \
    --deliver all \
    --ack explicit \
    --wait 30s \
    --max-deliver 3

# 3. Use durable consumer with explicit ack
[nats.jetstream.consumers.order-processor]
deliver_policy = "all"  # Start from beginning
ack_policy = "explicit"  # Must explicitly ACK
ack_wait = "30s"  # Timeout if not ACKed
max_deliver = 3  # Retry 3 times

“Duplicate event processing”

Symptoms: Same event processed multiple times; duplicate orders created

Causes:

No idempotency checks
At-least-once delivery semantics
Retry without deduplication

Solutions:

-- 1. Implement idempotency at the database level.
-- The SQL function that processes the event should be idempotent.
-- Use INSERT ... ON CONFLICT DO NOTHING with a unique event_id column:
CREATE TABLE tb_processed_event (
    pk_processed_event BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    id                 UUID   DEFAULT gen_random_uuid() UNIQUE NOT NULL,
    identifier         TEXT   UNIQUE NOT NULL,  -- event_id
    processed_at       TIMESTAMPTZ DEFAULT now()
);

-- In the processing function:
-- INSERT INTO tb_processed_event (identifier) VALUES (p_event_id)
-- ON CONFLICT (identifier) DO NOTHING;
-- IF NOT FOUND THEN RETURN; END IF;  -- Already processed

# 2. Use NATS JetStream's built-in message deduplication.
# Set a deduplication window when creating the stream:
nats stream add orders \
    --subjects "fraiseql.order.>" \
    --dedup-window 24h \
    --max-msgs 1000000

# Publishers include a Nats-Msg-Id header for deduplication:
nats pub fraiseql.order.created '{"order_id":"..."}' \
    --header Nats-Msg-Id:evt_550e8400

Federation + NATS Issues

”Saga completes but event never publishes”

Symptoms: Order created successfully but notification service doesn’t receive event

Causes:

Event publish happens after saga completes but before response
NATS publish fails silently
Network partition after federation but before NATS

Solutions:

-- 1. Use the transactional outbox pattern for guaranteed delivery.
-- Write events to a tb_pending_event table inside the same transaction
-- as the mutation, then a separate process (or pg_notify) publishes to NATS:
CREATE TABLE tb_pending_event (
    pk_pending_event BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    id               UUID   DEFAULT gen_random_uuid() UNIQUE NOT NULL,
    identifier       TEXT   UNIQUE NOT NULL,  -- idempotency key
    subject          TEXT   NOT NULL,         -- NATS subject
    payload          JSONB  NOT NULL,
    created_at       TIMESTAMPTZ DEFAULT now(),
    published_at     TIMESTAMPTZ
);

-- Inside fn_create_order, after inserting the order, also insert the event:
-- INSERT INTO tb_pending_event (identifier, subject, payload)
-- VALUES (
--   'order.created.' || v_order_id,
--   'fraiseql.order.confirmed',
--   jsonb_build_object('order_id', v_order_id, ...)
-- );

# 2. Configure FraiseQL observers to publish tb_pending_event rows to NATS.
# The Rust runtime handles polling/pg_notify and publishing:
[observers]
backend = "nats"
nats_url = "nats://localhost:4222"

# 3. Monitor NATS publish errors via the FraiseQL metrics endpoint:
curl http://localhost:8080/metrics | grep nats_publish
# fraiseql_nats_publish_total{status="success"} 1234
# fraiseql_nats_publish_total{status="error"} 0

“Race condition: Event arrives before federation completes”

Symptoms: Notification service processes event but queries return empty

Causes:

Event subscriber queries federation before saga completes
Event published before database transaction commits
Clock skew or timing issue

Solutions:

-- 1. Publish only after transaction commits.
-- Use the transactional outbox pattern (see above): insert the event row
-- in the same transaction as the mutation. The outbox publisher only
-- dispatches to NATS after the PostgreSQL transaction is durably committed.

# 2. Include all necessary data in the event payload so downstream services
# do not need to query the API before the data is replicated.
# Publish from the SQL function via the outbox, including a full snapshot:
#
# INSERT INTO tb_pending_event (identifier, subject, payload)
# VALUES (
#   'order.confirmed.' || v_order_id,
#   'fraiseql.order.confirmed',
#   jsonb_build_object(
#     'order_id',     v_order_id,
#     'customer_id',  v_customer_id,
#     'total',        v_total::text,
#     'items',        v_items_json   -- full snapshot, no follow-up query needed
#   )
# );

# 3. Configure retry/backoff for the NATS observer in fraiseql.toml:
[observers]
backend = "nats"
nats_url = "nats://localhost:4222"

Database-Specific Gotchas

”Deadlock between federation and saga"

-- Problem: Saga holds lock while federation waits
-- Solution: Use lower isolation level
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

"Federated queries with large result sets OOM"

# Use cursor-based (Relay) pagination to avoid loading large result sets at once.
# Define the query with relay=True in fraiseql.config():
@fraiseql.query
def orders(limit: int = 100) -> list[Order]:
    """Use cursor pagination — configured in fraiseql.toml federation section."""
    return fraiseql.config(sql_source="v_order", relay=True)

# Then query with cursor-based pagination from the client:
query {
  orders(first: 100, after: "cursor-from-previous-page") {
    edges { node { id } cursor }
    pageInfo { hasNextPage endCursor }
  }
}

"SQLite locking issues with federation”

# SQLite has a single writer. Configure a smaller pool size in fraiseql.toml
# to reduce write contention:
[database]
pool_max = 1        # SQLite: single writer
pool_min = 1

# Enable WAL mode on the SQLite database for better read concurrency:
sqlite3 database.db "PRAGMA journal_mode=WAL;"

Monitoring and Debugging Checklist

Federation Reference

Complete reference documentation for FraiseQL’s federation capabilities. Federation Guide

NATS Reference

Reference documentation for NATS integration and JetStream configuration. NATS Guide

Error Handling

Patterns for handling errors in federated and event-driven applications. Error Handling Guide

General Troubleshooting

Diagnose connection, query, and infrastructure issues. Troubleshooting Index