[spark][doc] Add Spark batch union read by Yohahaha · Pull Request #3142 · apache/fluss

Yohahaha · 2026-04-20T13:58:13Z

Purpose

Linked issue: close #xxx

Brief change log

Tests

API and Format

Documentation

Yohahaha · 2026-04-20T13:58:45Z

@wuchong @YannByron @luoyuxia please take a look, thank you!

beryllw · 2026-04-22T07:01:26Z

+The union read works for both **log tables** and **primary key tables**:
+
+- **Log tables**: Combines Fluss log data with lake historical data
+- **Primary key tables**: Merges the latest Fluss snapshot with log changes and lake history to provide the most up-to-date view


Combines lake snapshot data with recent KV log changes using sort-merge to provide the most up-to-date view

The phrase "latest Fluss snapshot" may cause confusion, as Fluss has its own internal snapshot concept (used for KV compaction).

beryllw · 2026-04-22T07:08:04Z

+-- Returns complete view combining Fluss and lake data
+SELECT * FROM fluss_order_with_lake ORDER BY order_key;
+```
+


May be we could add a note:

Union read requires `scan.startup.mode = full` (default). Non-FULL modes (e.g., `earliest`, `latest`) bypass the lake path and read only from Fluss.

scan.startup.mode was not used in batch read actually, will fix related codes in another pr.

beryllw · 2026-04-22T07:10:37Z

Thanks for the PR! Overall LGTM, with a few minor comments.

luoyuxia

@Yohahaha Hi, thanks for the pr. Left minor comments. PTAL

Copilot

Pull request overview

Updates Fluss website documentation to reflect Spark batch support for lake-enabled “union read” and to surface that Spark is now a supported engine for union reads.

Changes:

Update Lakehouse overview to state both Flink and Spark support union reads.
Expand Spark “Reads” docs with a new section describing union reads for lake-enabled tables.
Update Spark “Getting Started” feature matrix note to mention union read support.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
website/docs/streaming-lakehouse/overview.md	Updates union read engine support statement to include Spark.
website/docs/engine-spark/reads.md	Removes old limitation and adds documentation + examples for Spark batch union reads on lake-enabled tables.
website/docs/engine-spark/getting-started.md	Updates feature support note for batch select to mention union read.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-23T13:25:30Z

+
+```sql title="Spark SQL"
+-- Query will union data from Fluss and lake
+SELECT SUM(total_amount) AS total_revenue FROM fluss_order_with_lake;


The example query uses total_amount, but the table created below defines total_price (and no total_amount). As written, this SQL will fail when users try it; update the column name in the query (or the example schema) so they match.

Suggested change

SELECT SUM(total_amount) AS total_revenue FROM fluss_order_with_lake;

SELECT SUM(total_price) AS total_revenue FROM fluss_order_with_lake;

Copilot · 2026-04-23T13:25:31Z

+
+#### Union Read
+
+To read the full dataset, simply query the table without any suffix. The Spark connector automatically unions data from Fluss and the lake storage:


This section says to query the table "without any suffix", but this page doesn't introduce any table suffix concept for Spark (unlike some Flink docs). Consider rephrasing to "query the table directly" or explicitly documenting what suffixes (if any) are supported for Spark reads and what they do.

Suggested change

To read the full dataset, simply query the table without any suffix. The Spark connector automatically unions data from Fluss and the lake storage:

To read the full dataset, simply query the table directly. The Spark connector automatically unions data from Fluss and the lake storage:

Copilot · 2026-04-23T13:25:31Z

 | [SQL Add Partition](ddl.md#add-partition)             | ✔️    |                                             |
 | [SQL Drop Partition](ddl.md#drop-partition)          | ✔️    |                                             |
-| [SQL Select (Batch)](reads.md)                       | ✔️    | Log table and primary-key table             |
+| [SQL Select (Batch)](reads.md)                       | ✔️    | Log table and primary-key table with union read |


In the feature table note, "Log table and primary-key table with union read" is ambiguous (it reads like only the primary-key table has union read). Consider rewording to make it clear that union read is supported for both table types when lake-enabled.

Suggested change

| [SQL Select (Batch)](reads.md) | ✔️ | Log table and primary-key table with union read |

| [SQL Select (Batch)](reads.md) | ✔️ | Log table and primary-key table; both support union read when lake-enabled |

luoyuxia

Thanks for update. +1

fix

ba59247

Yohahaha marked this pull request as ready for review April 20, 2026 13:58

beryllw reviewed Apr 22, 2026

View reviewed changes

fix comments

3c12072

luoyuxia reviewed Apr 23, 2026

View reviewed changes

Comment thread website/docs/engine-spark/reads.md Outdated

Comment thread website/docs/engine-spark/reads.md Outdated

luoyuxia requested a review from Copilot April 23, 2026 13:20

Copilot started reviewing on behalf of luoyuxia April 23, 2026 13:20 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

fix comments

c5cd5db

luoyuxia approved these changes Apr 24, 2026

View reviewed changes

luoyuxia merged commit b6d2f33 into apache:main Apr 24, 2026
2 checks passed

Yohahaha deleted the spark-union-read-doc branch April 24, 2026 03:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark][doc] Add Spark batch union read#3142

[spark][doc] Add Spark batch union read#3142
luoyuxia merged 3 commits intoapache:mainfrom
Yohahaha:spark-union-read-doc

Yohahaha commented Apr 20, 2026

Uh oh!

Yohahaha commented Apr 20, 2026

Uh oh!

beryllw Apr 22, 2026 •

edited

Loading

Uh oh!

beryllw Apr 22, 2026

Uh oh!

Yohahaha Apr 22, 2026

Uh oh!

beryllw commented Apr 22, 2026

Uh oh!

luoyuxia left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

luoyuxia left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	SELECT SUM(total_amount) AS total_revenue FROM fluss_order_with_lake;
	SELECT SUM(total_price) AS total_revenue FROM fluss_order_with_lake;


		#### Union Read

		To read the full dataset, simply query the table without any suffix. The Spark connector automatically unions data from Fluss and the lake storage:

	\| [SQL Select (Batch)](reads.md) \| ✔️ \| Log table and primary-key table with union read \|
	\| [SQL Select (Batch)](reads.md) \| ✔️ \| Log table and primary-key table; both support union read when lake-enabled \|

Conversation

Yohahaha commented Apr 20, 2026

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

Yohahaha commented Apr 20, 2026

Uh oh!

beryllw Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beryllw Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Yohahaha Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

beryllw commented Apr 22, 2026

Uh oh!

luoyuxia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

luoyuxia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

beryllw Apr 22, 2026 •

edited

Loading