Skip to content

feat: add base64 expression#4159

Open
andygrove wants to merge 1 commit intoapache:mainfrom
andygrove:add-base64-expression
Open

feat: add base64 expression#4159
andygrove wants to merge 1 commit intoapache:mainfrom
andygrove:add-base64-expression

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #419.

Rationale for this change

base64 is a commonly used Spark string function. The expression coverage doc previously listed it as unsupported, so queries using it fell back to Spark.

What changes are included in this PR?

  • New native function spark_base64 in native/spark-expr/src/string_funcs/base64.rs that produces padded RFC 4648 base64 (no line breaks). Wired into create_comet_physical_fun as "base64".
  • New Scala serdes in spark/src/main/scala/org/apache/comet/serde/strings.scala:
    • CometBase64 for the Spark 3.4 case-class shape (Base64(child)). Always returns Incompatible because Spark 3.4 always chunks the output.
    • CometBase64StaticInvoke for the Spark 3.5+ shape, where Base64 is RuntimeReplaceable and arrives as StaticInvoke(classOf[Base64], "encode", Seq(child, Literal(chunkBase64))). Returns Compatible only when the literal chunkBase64 is false; otherwise Incompatible.
  • CometStaticInvoke now delegates getSupportLevel and getExprConfigName to its inner handler so the Base64-specific support level and config name (spark.comet.expr.Base64.allowIncompatible) take effect through the StaticInvoke dispatch path.
  • Comet SQL Tests:
    • spark/src/test/resources/sql-tests/expressions/string/base64.sql covers binary and string columns, literals, NULL, empty input, the SPARK-47307 58-byte chunking boundary, a 200-byte input, and the full 0x00..0xFF byte range.
    • spark/src/test/resources/sql-tests/expressions/string/base64_chunked_fallback.sql asserts that on Spark 3.5+ Comet falls back to Spark when spark.sql.chunkBase64String.enabled=true and incompatible expressions have not been opted in.
  • Coverage doc docs/source/contributor-guide/spark_expressions_support.md updated with audit annotations for Spark 3.4.3 / 3.5.8 / 4.0.1.

This change was scaffolded with the implement-comet-expression Claude skill and the resulting implementation was reviewed with the audit-comet-expression skill.

How are these changes tested?

  • New Comet SQL Tests under spark/src/test/resources/sql-tests/expressions/string/ cover both the compatible (chunkBase64String.enabled=false) and the fallback (chunkBase64String.enabled=true) paths.
  • New Rust unit tests in native/spark-expr/src/string_funcs/base64.rs cover array, scalar, NULL, and padding cases.
  • make format, cargo clippy --all-targets --workspace -- -D warnings, and the targeted CometSqlFileTestSuite runs all pass locally.

Implement Spark base64 in Comet. Output is unchunked padded base64; the
expression is marked Compatible only when chunkBase64=false (Spark 3.5+
with `spark.sql.chunkBase64String.enabled=false`), and Incompatible
otherwise so users opt in via `spark.comet.expr.allowIncompatible=true`.

Closes apache#419
@andygrove andygrove force-pushed the add-base64-expression branch from 8df5a35 to 0f00230 Compare April 30, 2026 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support spark base64 function

1 participant