Skip to content

Fix parquet compression handling in aws_s3 sink#37

Merged
sundaresanr merged 1 commit intov0.54.0-exaforcefrom
parquet-compression
Apr 21, 2026
Merged

Fix parquet compression handling in aws_s3 sink#37
sundaresanr merged 1 commit intov0.54.0-exaforcefrom
parquet-compression

Conversation

@sundaresanr
Copy link
Copy Markdown

@sundaresanr sundaresanr commented Apr 21, 2026

Parquet sinks were producing unreadable files unless users explicitly set
compression = none due to two issues:

  • Sink-level compression (default: gzip) wrapped parquet output,
    causing S3 objects to start with gzip magic bytes instead of PAR1.
    This led to ingestion failures in systems like DuckDB and Snowflake.
  • Parquet writer always used UNCOMPRESSED, and the parquet crate was
    built without compression features (snappy, flate2, zstd), causing
    runtime panics when other compression types were attempted.

Fix:

  • For codec = parquet, pass sink-level compression into parquet
    WriterProperties (internal compression).
  • Force transport-layer compression to None to avoid double wrapping.
  • Leave behavior unchanged for non-parquet codecs.

Compression mapping:

  • None -> UNCOMPRESSED
  • Snappy -> SNAPPY
  • Gzip -> GZIP
  • Zstd -> ZSTD
  • Zlib -> rejected at build time (no parquet equivalent)

Two bugs caused parquet sinks to emit unreadable files unless users
explicitly set compression: none:

1. Sink-level `compression` (default gzip) wrapped the parquet bytes,
   so the S3 object started with gzip magic instead of PAR1 — DuckDB
   and Snowflake rejected it.
2. Parquet internal compression was hardcoded to UNCOMPRESSED, and
   the parquet crate was built without snap/flate2/zstd features so
   even trying SNAPPY would panic at runtime.

Fix: when codec=parquet, feed the sink-level compression into the
parquet writer's WriterProperties and force transport-layer compression
to None. Non-parquet codecs are unaffected.

Vector Compression -> parquet: None -> UNCOMPRESSED, Snappy -> SNAPPY,
Gzip -> GZIP, Zstd -> ZSTD. Zlib rejected at build time (no parquet
equivalent).

Also fixes a broken test import (vrl::value::btreemap was made private)
so the parquet test module actually compiles.
@sundaresanr sundaresanr force-pushed the parquet-compression branch from 62ac478 to 5825924 Compare April 21, 2026 03:38
@sundaresanr sundaresanr merged commit 2626c73 into v0.54.0-exaforce Apr 21, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants