[spark][doc] Add Spark batch union read#3142
Conversation
|
@wuchong @YannByron @luoyuxia please take a look, thank you! |
| The union read works for both **log tables** and **primary key tables**: | ||
|
|
||
| - **Log tables**: Combines Fluss log data with lake historical data | ||
| - **Primary key tables**: Merges the latest Fluss snapshot with log changes and lake history to provide the most up-to-date view |
There was a problem hiding this comment.
Combines lake snapshot data with recent KV log changes using sort-merge to provide the most up-to-date view
The phrase "latest Fluss snapshot" may cause confusion, as Fluss has its own internal snapshot concept (used for KV compaction).
| -- Returns complete view combining Fluss and lake data | ||
| SELECT * FROM fluss_order_with_lake ORDER BY order_key; | ||
| ``` | ||
|
|
There was a problem hiding this comment.
May be we could add a note:
Union read requires `scan.startup.mode = full` (default). Non-FULL modes (e.g., `earliest`, `latest`) bypass the lake path and read only from Fluss.
There was a problem hiding this comment.
scan.startup.mode was not used in batch read actually, will fix related codes in another pr.
|
Thanks for the PR! Overall LGTM, with a few minor comments. |
There was a problem hiding this comment.
Pull request overview
Updates Fluss website documentation to reflect Spark batch support for lake-enabled “union read” and to surface that Spark is now a supported engine for union reads.
Changes:
- Update Lakehouse overview to state both Flink and Spark support union reads.
- Expand Spark “Reads” docs with a new section describing union reads for lake-enabled tables.
- Update Spark “Getting Started” feature matrix note to mention union read support.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| website/docs/streaming-lakehouse/overview.md | Updates union read engine support statement to include Spark. |
| website/docs/engine-spark/reads.md | Removes old limitation and adds documentation + examples for Spark batch union reads on lake-enabled tables. |
| website/docs/engine-spark/getting-started.md | Updates feature support note for batch select to mention union read. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| ```sql title="Spark SQL" | ||
| -- Query will union data from Fluss and lake | ||
| SELECT SUM(total_amount) AS total_revenue FROM fluss_order_with_lake; |
There was a problem hiding this comment.
The example query uses total_amount, but the table created below defines total_price (and no total_amount). As written, this SQL will fail when users try it; update the column name in the query (or the example schema) so they match.
| SELECT SUM(total_amount) AS total_revenue FROM fluss_order_with_lake; | |
| SELECT SUM(total_price) AS total_revenue FROM fluss_order_with_lake; |
|
|
||
| #### Union Read | ||
|
|
||
| To read the full dataset, simply query the table without any suffix. The Spark connector automatically unions data from Fluss and the lake storage: |
There was a problem hiding this comment.
This section says to query the table "without any suffix", but this page doesn't introduce any table suffix concept for Spark (unlike some Flink docs). Consider rephrasing to "query the table directly" or explicitly documenting what suffixes (if any) are supported for Spark reads and what they do.
| To read the full dataset, simply query the table without any suffix. The Spark connector automatically unions data from Fluss and the lake storage: | |
| To read the full dataset, simply query the table directly. The Spark connector automatically unions data from Fluss and the lake storage: |
| | [SQL Add Partition](ddl.md#add-partition) | ✔️ | | | ||
| | [SQL Drop Partition](ddl.md#drop-partition) | ✔️ | | | ||
| | [SQL Select (Batch)](reads.md) | ✔️ | Log table and primary-key table | | ||
| | [SQL Select (Batch)](reads.md) | ✔️ | Log table and primary-key table with union read | |
There was a problem hiding this comment.
In the feature table note, "Log table and primary-key table with union read" is ambiguous (it reads like only the primary-key table has union read). Consider rewording to make it clear that union read is supported for both table types when lake-enabled.
| | [SQL Select (Batch)](reads.md) | ✔️ | Log table and primary-key table with union read | | |
| | [SQL Select (Batch)](reads.md) | ✔️ | Log table and primary-key table; both support union read when lake-enabled | |
Purpose
Linked issue: close #xxx
Brief change log
Tests
API and Format
Documentation