Skip to content

Analytics & Arrow Flight DataPlane

FraiseQL exposes an Apache Arrow Flight DataPlane alongside the standard GraphQL endpoint, enabling analytics clients to stream columnar data at wire speed — bypassing row-by-row serialization for aggregate and reporting queries that return millions of rows.

The Arrow Flight DataPlane is ideal for dashboards, BI tools, data pipelines, and any workload where throughput matters more than the flexibility of a GraphQL query.

Before enabling the Arrow Flight DataPlane:

  • PostgreSQL 14 or later
  • FraiseQL running with a PostgreSQL DATABASE_URL
  • Client library for your language:
    • Python: pyarrow (pip install pyarrow)
    • TypeScript/Node: apache-arrow (npm install apache-arrow)
    • Rust: arrow and arrow-flight crates

Enable the DataPlane in fraiseql.toml:

[analytics]
enabled = true

The Arrow Flight service starts on port 50051 by default. This port is in addition to the standard GraphQL port (8080).

The only configuration key for the [analytics] section is enabled. Port, batch size, and compression are not configurable via TOML — they are compile-time constants set in the fraiseql-arrow crate.

  1. Enable the DataPlane

    Add the [analytics] block to fraiseql.toml:

    [analytics]
    enabled = true
  2. Start FraiseQL

    Terminal window
    fraiseql run

    You should see a log line confirming the Flight service has started:

    INFO fraiseql::flight: Arrow Flight DataPlane listening on 0.0.0.0:50051
  3. Connect a client

    Point your Arrow Flight client at grpc://localhost:50051. Authentication uses the same JWT/API key tokens as the GraphQL endpoint — pass the token as a gRPC authorization header (Bearer eyJ...). TLS follows the server-level TLS configuration.

The Flight DataPlane reads any v_* view exposed by FraiseQL. Define your aggregation logic in SQL and register the query in your schema.

  1. Create the SQL view

    CREATE VIEW va_analytics_sales AS
    SELECT
    product_id,
    SUM(revenue) AS revenue,
    COUNT(*) AS orders,
    ROUND(AVG(revenue), 2) AS avg_order_value
    FROM tf_sales
    GROUP BY product_id
    ORDER BY revenue DESC;
  2. Register in your schema

    import fraiseql
    from dataclasses import dataclass
    from decimal import Decimal
    @fraiseql.type
    class SalesAggregate:
    """Aggregated sales metrics by product."""
    product_id: str
    revenue: Decimal
    orders: int
    avg_order_value: Decimal
    @fraiseql.query
    def analytics_sales() -> list[SalesAggregate]:
    """Sales aggregates — available via both GraphQL and Arrow Flight."""
    return fraiseql.config(sql_source="va_analytics_sales")
  3. Query via Flight

    Any view registered with @fraiseql.query or @Query() is automatically accessible through both the GraphQL endpoint and the Arrow Flight DataPlane.

import pyarrow.flight as flight
import json
# Connect to the Flight service
client = flight.connect("grpc://localhost:50051")
# Authenticate with Bearer token (same JWT as GraphQL endpoint)
options = flight.FlightCallOptions(headers=[(b"authorization", b"Bearer <your-jwt-token>")])
# Create a ticket using the registered query name
ticket = flight.Ticket(json.dumps({"query": "analytics_sales"}).encode())
# Read the data
reader = client.do_get(ticket, options)
table = reader.read_all()
# Convert to pandas for analysis
print(table.to_pandas())

Expected output:

product_id revenue orders avg_order_value
0 prod_001 45231.5 312 145.0
1 prod_002 38920.0 287 135.6
2 prod_003 29118.0 201 144.9

The returned object is a PyArrow Table. From there you can:

  • Convert to pandas: table.to_pandas()
  • Convert to polars: pl.from_arrow(table)
  • Write to Parquet: pq.write_table(table, 'output.parquet')
  • Stream to DuckDB: Direct Arrow integration
import { tableFromIPC } from 'apache-arrow';
import { FlightClient } from '@apache-arrow/flight';
const client = new FlightClient({ host: 'localhost', port: 50051 });
const ticket = Buffer.from(JSON.stringify({ query: 'analytics_sales' }));
const batches = [];
for await (const batch of client.doGet({ ticket })) {
batches.push(batch);
}
const table = tableFromIPC(batches);
console.table(table.toArray().slice(0, 5));

Pass filter parameters in the query JSON to push predicates down to PostgreSQL:

import json
import pyarrow.flight as flight
client = flight.connect("grpc://localhost:50051")
command = json.dumps({
"query": "analytics_sales",
"variables": {
"start_date": "2024-01-01",
"end_date": "2024-03-31",
"region": "US"
}
}).encode()
info = client.get_flight_info(flight.FlightDescriptor.for_command(command))
reader = client.do_get(info.endpoints[0].ticket)
table = reader.read_all()

Arrow Flight sends data in Apache Arrow’s columnar binary format with optional LZ4 or Zstd compression. GraphQL sends JSON. For analytical payloads, the difference is substantial:

WorkloadGraphQL (JSON)Arrow Flight (LZ4)Speedup
100k rows, 6 columns~2,800 ms~180 ms~15x
1M rows, 6 columns~28,000 ms~1,400 ms~20x
10M rows, 6 columnsTimeout~12,000 ms
Pandas load (1M rows)~9,200 ms~420 ms~22x
Use CaseRecommendation
Transactional readsGraphQL — ACID semantics, mutations, subscriptions
Real-time updatesGraphQL — Subscriptions over WebSocket
Small result sets (less than 10k rows)GraphQL — Simpler client libraries, HTTP caching
Large aggregationsArrow Flight — Columnar format, compression
Data pipelinesArrow Flight — Stream processing, Parquet export
BI tool integrationArrow Flight — Standard Arrow ecosystem

Arrow Flight is pull-based: the client iterates over record batches returned by the server. For large result sets, read batches one at a time rather than loading everything into memory at once:

import pyarrow.flight as flight
import json
client = flight.connect("grpc://localhost:50051")
options = flight.FlightCallOptions(headers=[(b"authorization", b"Bearer <your-jwt-token>")])
ticket = flight.Ticket(json.dumps({"query": "analytics_sales"}).encode())
reader = client.do_get(ticket, options)
# Process one batch at a time — avoids loading all rows into memory
for chunk in reader:
df = chunk.data.to_pandas()
print(f"[{chunk.data.num_rows} rows] revenue sum: {df['revenue'].sum()}")

For real-time counters and live dashboards, use GraphQL subscriptions over WebSocket instead — Arrow Flight is not push-based.

TLS for the Arrow Flight service is controlled by the server-level TLS configuration (not a separate [analytics] subsection). When the server TLS is enabled, the Flight service uses the same certificates.

Client connection with TLS:

import pyarrow.flight as flight
# TLS enabled — use grpcs scheme
client = flight.connect("grpcs://analytics.example.com:50051")

The Arrow Flight service exposes Prometheus metrics:

MetricDescription
fraiseql_arrow_requests_totalTotal Arrow Flight requests
fraiseql_arrow_rows_streamed_totalTotal rows streamed
fraiseql_arrow_query_durationQuery execution time
fraiseql_arrow_bytes_sentBytes transferred

Health check endpoint:

Terminal window
curl http://localhost:8080/health

Includes Flight service status:

{
"status": "ok",
"analytics": {
"enabled": true,
"port": 50051,
"connections": 12
}
}

Arrow Flight is a separate analytics transport distinct from the API transports (GraphQL, REST, gRPC). It is designed for bulk data transfer, not request-response API patterns.

Current Arrow Flight implementation has these limitations:

  1. PostgreSQL only: MySQL, SQLite, and SQL Server are not supported
  2. No Flight SQL: Uses custom JSON descriptors, not standard Flight SQL protocol
  3. Read-only: Mutations must use the GraphQL endpoint
  4. Pull-based only: Streaming is pull-based — the client reads batches from the server; there is no server-push streaming mode

“Connection refused” on port 50051

  1. Verify the DataPlane is enabled in fraiseql.toml:

    [analytics]
    enabled = true
  2. Check FraiseQL logs for the startup message:

    INFO fraiseql::flight: Arrow Flight DataPlane listening on 0.0.0.0:50051
  3. Confirm the port is not blocked by a firewall

“View not found” errors

Ensure the view name in your Flight query matches the sql_source in your schema definition exactly (case-sensitive).

Slow query performance

Arrow Flight is only as fast as the underlying PostgreSQL query. Check:

EXPLAIN ANALYZE SELECT * FROM va_analytics_sales WHERE ...;

Missing indexes on filter columns are the most common cause of slow analytics queries.

Large memory usage

Arrow Flight streams data in record batches. If client memory is a concern, process batches one at a time using the chunked reading pattern instead of reader.read_all(). See the “Reading Large Datasets” section above for an example.

Performance

Performance Guide — Query optimization and indexes for analytics views

Caching

Caching — Cache heavy aggregations

Federation

Federation — Combine analytics from multiple databases