GH-45946: [C++][Parquet] Variant decoding#50121
Conversation
| /// Searches the field IDs in the object, resolving each against the | ||
| /// metadata dictionary. Per spec, field IDs are in lexicographic order | ||
| /// of their corresponding key names, enabling binary search for large | ||
| /// objects (>=32 fields). For smaller objects, linear scan is used. |
There was a problem hiding this comment.
How was the 32 threshold determined?
|
|
||
| /// \brief Basic type codes from bits 0-1 of the value header byte. | ||
| /// | ||
| /// Variant Encoding Spec §3: "Value encoding" |
There was a problem hiding this comment.
nit: The current version of the spec does not contain paragraph §3 and §3.1. I would just add a link to the section with tables: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types
| /// Implements parsing logic per the Variant Encoding Spec: | ||
| /// https://github.com/apache/parquet-format/blob/master/VariantEncoding.md | ||
| /// | ||
| /// The "internal" in the filename refers to the binary encoding internals |
There was a problem hiding this comment.
I don't have a strong opinion here. But maybe instead of explaining in the comment what "internal" means it would be better to rename a file e.g. to variant_binary_encoding, variant_internal_encoding etc.
|
|
||
| class VariantIntegrationTest : public ::testing::Test {}; | ||
|
|
||
| TEST_F(VariantIntegrationTest, FullRoundTrip) { |
There was a problem hiding this comment.
I would also add more tests demonstrating how to use all these new functions together. For example, let's assume we have the following Variant:
{
"name": "Alice",
"age": 30,
"addresses": {
"postal": {
"country": "USA",
"city": "New York"
},
"billing": {
"country": "USA",
"city": "Chicago"
}
}
}
If we want to find the city for the postal address, we would first need to use FindObjectField to find "addresses", then "postal", and finally "city". After that, we would read the value of the "city" field.
| /// the visitor. Returns the number of bytes consumed. | ||
| /// | ||
| /// This is the core recursive function. | ||
| Status DecodeValueAt(const VariantMetadata& metadata, const uint8_t* data, int64_t length, |
There was a problem hiding this comment.
I think this function should be public. Let's assume I want to read the value of a specific nested field from a Variant using a path (e.g., field_1.field_2.field_3).
My current understanding is that I would first need to call FindObjectField to locate "field_1". If it exists, I then have to find "field_2", and finally "field_3". However, I have to implement the last step—reading the actual value—on my own because DecodeValueAt is not public, and DecodeVariantValue only allows for decoding the entire Variant.
| /// \return Status::OK on success, Status::Invalid on malformed input | ||
| /// | ||
| /// \note The data buffer must remain valid for the duration of the call. | ||
| ARROW_EXPORT Status DecodeVariantValue(const VariantMetadata& metadata, |
There was a problem hiding this comment.
Do you have a plan to also support reading/decoding shredded variants?
Rationale for this change
This is part of the GH-45937 umbrella (Add variant support to C++ Parquet). The Variant Encoding Spec defines a binary format for semi-structured data in Parquet. This PR adds the decoding (reading) side, which is a prerequisite for the encoder (GH-45947, PR #50122) and shredding support (GH-45948, PR #50232).
The implementation targets feature parity with the existing arrow-go
parquet/variantpackage, adapted to idiomatic C++ patterns. Divergences from Go are deliberate and documented in code comments.What changes are included in this PR?
Full Variant binary decoding per the Variant Encoding Spec. Adds
variant_internal.h/.ccwith:Decoder (visitor/SAX-style traversal):
DecodeMetadata()parses the string dictionary from raw bytesDecodeVariantValue()recursive traversal invoking aVariantVisitorfor each valuekMaxNestingDepth = 128) security hardening for C++ stack semantics (Go doesn't need this due to goroutine stack growth)Random-access utilities (for future Parquet reader integration):
ValueSize()compute byte size of a value without full decodeFindObjectField()lookup by field name (binary search for 32 fields, linear for small objects)GetArrayElement()O(1) element access by indexGetObjectFieldAt()positional field accessFindMetadataKey()dictionary ID lookup (binary search if sorted)Design choices (deliberate divergences from Go documented in code):
TypeVisitor,ArrayVisitorprecedent)FindObjectFieldbinary search uses signedint32_tforlo/hito avoid an unsigned underflow pattern present in Go'sObjectValue.ValueByKey()(separate bug report TBD)Bug discovered in arrow-go during development:
valueSize()inparquet/variant/utils.goreads the wrong bit for arrayis_largeuses(typeInfo >> 4) & 0x1(object's is_large position) instead of(typeInfo >> 2) & 0x1. Fix submitted as [Go][Parquet] valuesize() uses incorrect bit position for basicarray large flag arrow-go#839.ObjectValue.ValueByKey()binary search usesj = mid - 1wherejisuint32wraps toMaxUint32whenmid == 0, skipping elements. Reported as [Go][Parquet] ObjectValue.ValueByKey binary search skips elements at mid - 1 arrow-go#842.Are these changes tested?
Yes. 165 tests pass with
BUILD_WARNING_LEVEL=CHECKIN(warnings-as-errors):Are there any user-facing changes?
No breaking changes. This adds new public API (
arrow::extension::variant_internalnamespace) that did not previously exist. The headervariant_internal.his installed "internal" in the filename refers to "binary encoding internals" not visibility.Next in chain: The encoder PR (#50122) is stacked on this branch and should be reviewed/merged after this one. The shredding PR (#50232) depends on both.
AI Disclosure: AI coding assistants were used during development for scaffolding, test generation, and review iteration. All code has been reviewed, debugged, and verified by the author who owns and understands the changes.