Vai al contenuto principale

Apache Spark

Apache Spark is the default execution engine for distributed data processing in Ilum. It runs on Kubernetes (with native CRD-based pod orchestration) or Apache Hadoop Yarn, and is exposed through batch jobs, interactive sessions, in-app SQL notebooks, and the Apache Kyuubi SQL gateway.

Ilum bundles Apache Spark 4.x by default, with Spark 3.x available for legacy workloads.

When to use Spark

Spark is the right engine for:

  • Large-scale ETL and data transformation pipelines.
  • Machine learning workloads using Spark ML or MLlib.
  • Complex joins and aggregations across large datasets.
  • Streaming workloads with Spark Structured Streaming.
  • Workloads that benefit from horizontal scaling across many executors.

For interactive analytics on medium-to-large data, consider Trino . For small-data and local execution, consider DuckDB . For low-latency stream processing, consider Apache Flink.

Execution model

Spark runs as a driver and a configurable number of executors:

  • Driver pod: One per job. Coordinates execution, holds the Spark session, and tracks task state.
  • Executor pods: Provisioned dynamically based on workload. Run individual tasks in parallel and hold cached data.

Ilum manages the full pod lifecycle, including image selection, resource limits, dynamic allocation, and cleanup on completion.

Workload types

Spark powers four kinds of workloads in Ilum:

  • Lavori : One-shot batch executions.
  • Servizi : Long-running interactive Spark sessions that execute code on demand without per-call initialization overhead.
  • Orari : Cron-driven recurring jobs.
  • Requests: Ad-hoc submissions through the REST API or UI.

All four are managed through the Carichi section of the Ilum UI.

Supported catalogs

Spark connects to all four Ilum catalogs:

Supported table formats

Spark reads and writes:

  • Lago Delta : ACID transactions, time travel, schema evolution.
  • Apache Iceberg : Partition evolution, hidden partitioning.
  • Apache Hudi : Record-level upserts, incremental processing.
  • Parquet, ORC, CSV , JSON, Avro: Standard file formats.

Le Tavoli Ilum abstraction lets you read and write Delta, Iceberg, and Hudi using the same Spark API.

Configurazione

Spark configuration is managed through Helm values and per-cluster settings:

ilum-core : 
scintilla :
Abilitato : vero
grappolo :
defaults:
spark.dynamicAllocation.enabled: "vero"
spark.dynamicAllocation.minExecutors: "1"
spark.dynamicAllocation.maxExecutors: "20"
spark.dynamicAllocation.executorIdleTimeout: "60s"

Per-cluster overrides are configured in the Workloads > Clusters UI and apply to all Spark jobs targeting that cluster.

Scintilla Connetti

Spark Connect provides a client-server architecture for remote Spark execution. Ilum deploys Spark Connect servers as standard jobs and includes a Kubernetes-aware proxy that allows Spark Connect endpoints to be reached across cluster boundaries.

Fare riferimento a Scintilla Connetti for details.

Submitting a Spark job

For a step-by-step walkthrough, refer to Run a simple Spark job.