Skip to content
Install
Sail logo

Sail

Author: lakehq

Description: Sail (by LakeSail) is an open-source, Rust-native distributed compute engine compatible with the Spark Connect protocol (Spark SQL + DataFrame API). It provides a server that PySpark can connect to via `sc://host:port` with no code rewrites, and targets unified batch, streaming, and AI/compute-intensive workloads.

Stars: 1.2k

Forks: 82

License: Apache License 2.0

Category: Open Source

Overview

Installation

### Quick Start (Python)
1. Install Sail (PyPI package):```bash
pip install pysail
2. Install Spark Connect client support:```bash pip install "pyspark[connect]"
### Advanced / Production
- Install from source (for hardware-specific optimization): follow the documented Installation Guide: [https://docs.lakesail.com/sail/latest/introduction/installation/](https://docs.lakesail.com/sail/latest/introduction/installation/)
- Deploy on Kubernetes (cluster mode):
1. Apply the manifest: **CODEBLOCK_2**
2. Port-forward the Spark server: **CODEBLOCK_3** (See Kubernetes Deployment Guide: [https://docs.lakesail.com/sail/latest/guide/deployment/kubernetes.html](https://docs.lakesail.com/sail/latest/guide/deployment/kubernetes.html))
### Community
- Slack community link (from README): [https://www.launchpass.com/lakesail-community/free](https://www.launchpass.com/lakesail-community/free)

01

pip install pysail

Install the Sail Python package from PyPI.

02

pip install "pyspark[connect]"

Install PySpark with Spark Connect support (client components needed to connect to Sail).

03

sail spark server --port 50051

Start a local Sail Spark server on the specified port via the Sail CLI.

04

SparkConnectServer(port=50051)

Create a Spark Connect server instance in Python configured to listen on the given port.

05

server.start(background=False)

Start the Spark Connect server; with background=False it runs in the foreground.

06

kubectl apply -f sail.yaml

Deploy Sail to a Kubernetes cluster using the provided manifest.

07

kubectl -n sail port-forward service/sail-spark-server 50051:50051

Port-forward the Sail Spark server service from Kubernetes to localhost for client connections.

08

SparkSession.builder.remote("sc://localhost:50051").getOrCreate()

Connect a SparkSession to a remote Spark Connect endpoint (Sail) and create or reuse the session.

09

spark.sql("SELECT 1 + 1").show()

Execute a SQL query through the connected SparkSession and display the result.

FAQs

How do I migrate existing PySpark code to Sail without rewriting my DataFrame logic?

Change only your SparkSession creation to use Spark Connect connection string pointing to Sail's gRPC server, typically `sc://localhost:50051` or your deployed endpoint. Your existing DataFrame operations, transformations, and SQL queries remain unchanged because Sail implements the Spark Connect protocol. Test with your actual workload first, as incompatible operations will surface at runtime, not during refactoring.

What Spark SQL features are not yet supported by Sail, and how can I check compatibility for my specific queries?

LakeSail maintains a compatibility matrix in their GitHub repo tracking unsupported functions and edge cases. Test your specific queries against a Sail instance and monitor for fallback warnings. The project documents breaking changes in release notes, so subscribe to releases. Start with non-critical queries and incrementally expand coverage based on actual execution results.

How do I set up Sail's MCP server to query Parquet files stored in S3 from Claude Desktop?

After configuring Claude Desktop with Sail's MCP server, prompt Claude to use create_parquet_view with your S3 path. Ensure AWS credentials are configured via environment variables, CLI profiles, or IAM roles for S3 authentication. Once registered, request analysis in natural language and Claude generates and executes the Spark SQL query.

How does Sail's performance compare to Apache Spark on real-world workloads beyond the vendor-provided TPC-H benchmarks?

Real-world performance comparisons are scarce because Sail remains pre-v1.0 with limited production adoption. Independent benchmarks haven't emerged yet. Performance varies significantly based on query complexity, data skew, network latency in distributed setups, and which Spark SQL features you use—unsupported operations force fallbacks that eliminate gains. Run your actual workload on representative data volumes and infrastructure to measure meaningful differences.

What security precautions should I take when using Sail's MCP server to prevent accidental data modification?

Use read-only database credentials when connecting Sail to remote data sources like S3, Azure, GCS, or HDFS. Manually review every SQL query Claude generates before approving execution through the MCP tool interface. Consider running Sail in an isolated environment with separate credentials that lack write, update, or delete permissions on production datasets to create an additional safety boundary.

Can Sail replace Apache Spark for distributed production workloads on Kubernetes, or is it only suitable for testing and development?

Sail supports distributed Kubernetes deployment and exposes production features like OpenTelemetry tracing, but it remains pre-v1.0 with incomplete Spark SQL coverage and no production-ready streaming support. It's viable for non-streaming batch analytics where teams accept early-adopter risk, API changes between releases, and vendor-reported benchmarks, but Apache Spark remains necessary for full feature coverage and production SLAs today.

License: Apache License 2.0
Updated 3/18/2026