Vai al contenuto principale

Execution Engines

Ilum is a true multi-engine data lakehouse. SQL queries and data workloads can be executed against any of four engines, all sharing the same catalogs, table formats, and object storage:

  • Apache Spark : distributed processing for large-scale ETL, machine learning, and complex transformations.
  • Trino : massively parallel processing for interactive analytics and federated queries.
  • DuckDB : single-node, embedded execution for small-to-medium data and DuckLake-managed tables.
  • Apache Flink: low-latency stream processing and continuous data pipelines.

All four engines are fronted by the Apache Kyuubi SQL gateway, which provides session management, JDBC/ODBC compatibility, and unified authentication. An automatic engine router observes incoming queries and selects the engine best suited to each workload, with explicit user override available at any time.

Architettura

┌──────────────────────────────────┐
│ SQL Editor / JDBC / ODBC / API │
└────────────────┬─────────────────┘

┌────────▼────────┐
│ Apache Kyuubi │ Session management, auth,
│ SQL Gateway │ connection pooling
│ + Auto Router │
└─┬───┬─────┬───┬─┘
│ │ │ │
┌──▼┐ ┌▼─┐ ┌─▼┐ ┌▼──────┐
│Spk│ │Tr│ │Dk│ │Flink │
│ a │ │in│ │DB│ │(Beta) │
│ rk│ │ o│ │ │ │ │
└─┬─┘ └─┬┘ └┬─┘ └─┬─────┘
│ │ │ │
┌─────▼─────▼───▼─────▼──────┐
│ Catalog Layer │
│ Hive · Nessie · Unity │
│ · DuckLake │
├────────────────────────────┤
│ Storage Layer │
│ MinIO / S3 / GCS / WASBS │
│ / HDFS │
└────────────────────────────┘

When to use each engine

WorkloadRecommended engineWhy
Large-scale ETL, batch transformsApache Spark Distributed shuffle, dynamic allocation, mature optimizer
Machine learning pipelinesApache Spark MLlib, Spark ML, GPU executor support
Interactive analytics on large datasetsTrino Pipelined MPP execution, sub-second latency
Federated queries across data sourcesTrino Connectors for relational, NoSQL, search, lake
Ad-hoc exploration on small dataDuckDB Zero pod startup, in-process execution
Local-first analytics over DuckLakeDuckDB Native catalog integration, fast local queries
Low-latency stream processingApache FlinkEvent-time semantics, continuous queries
Streaming ingestion with batch parityApache Spark Structured StreamingSame code path as batch jobs

Engines share the same data through the catalog layer and object storage, so workloads can shift from Spark to Trino to DuckDB without rewriting queries or copying data.

Automatic engine router

The router selects the engine for each incoming SQL query based on several signals:

  • Estimated scan size: Pulled from catalog statistics. Small scans favour DuckDB; large scans favour Spark or Trino.
  • Workload type: Streaming queries route to Flink; interactive read-only queries route to Trino; heavy joins and aggregations route to Spark.
  • Locality: Queries against DuckLake-managed tables favour DuckDB; queries against tables already cached or warmed in another engine prefer that engine.
  • Engine availability: Only engines that are deployed and currently healthy are considered.

The router is conservative: when signals are ambiguous, it falls back to a configurable default engine.

Manual override

Every query can override the router by selecting an explicit engine:

  • In the SQL Editor: Use the Engine Selector dropdown.
  • Through the REST API: Set the engine field on the POST /api/v1/sql/execute request.
  • Through JDBC / ODBC: Set a session property (for example, SET ilum.engine = 'trino').

User overrides always take precedence over router decisions.

Apache Kyuubi SQL Gateway

Kyuubi is the unified SQL gateway in front of Spark, Trino, and Flink. It provides:

  • Session management: Long-lived SQL sessions per user with isolation.
  • JDBC and ODBC compatibility: Standard interfaces for BI tools (Tableau, PowerBI, Superset).
  • Autenticazione : Integrated with Ilum's RBAC, OAuth2, and LDAP stack.
  • Connection pooling: Efficient resource usage across many concurrent queries.
  • Engine federation: A single entry point for queries that may target any registered engine.

DuckDB is integrated separately as an in-process engine inside ilum-core for the lowest possible latency on small queries.

Configurazione

Each engine is enabled and configured through Helm values:

# Spark (always available through ilum-core)
ilum-core :
scintilla :
Abilitato : vero

# Trino
Trino :
Abilitato : vero

# DuckDB (default-on, with DuckLake catalog)
ilum-core :
SQL :
duckdb:
ducklake:
Abilitato : vero

# Kyuubi SQL gateway
ilum-sql :
Abilitato : vero
ilum-core :
SQL :
Abilitato : vero

Apache Flink is enabled per Enterprise deployment. Refer to the Flink page for details.

Per-engine documentation