fix(ipc): reject dictionary-encoded dictionary values by goutamadwant · Pull Request #10230 · apache/arrow-rs

goutamadwant · 2026-06-28T17:11:43Z

Which issue does this PR close?

Closes Buffer count mismatched with metadata when encoding records with dictionary of dictionaries #10213.

Rationale for this change

Arrow IPC dictionary batches write a dictionary field's values as the batch payload. When those values are themselves dictionary-encoded, the writer can produce IPC data that readers cannot decode, failing later with a buffer metadata mismatch.

What changes are included in this PR?

This adds schema validation to the IPC file and stream writer constructors so direct dictionary-of-dictionary schemas return a clear InvalidArgumentError before any IPC bytes are written. A low-level dictionary encoding guard is kept as a backstop.

A regression test covers the direct Dictionary(_, Dictionary(_, _)) case for both IPC stream and file writers.

Are these changes tested?

Yes.

cargo fmt --all -- --check
cargo test -p arrow-ipc
cargo test -p arrow-ipc --all-features
cargo clippy -p arrow-ipc --all-targets -- -D warnings
cargo clippy -p arrow-ipc --all-targets --all-features -- -D warnings

Are there any user-facing changes?

Yes. The IPC file and stream writers now return a clear error for direct dictionary-of-dictionary schemas instead of writing IPC data that fails during read. There are no public API changes.

thorfour

Thanks for this! This will be a helpful guard

thorfour · 2026-06-29T14:18:01Z

+                // the encoded buffers to the schema metadata.
+                if matches!(value_type.as_ref(), DataType::Dictionary(_, _)) {
+                    return Err(ArrowError::InvalidArgumentError(format!(
+                        "IPC does not support dictionary-encoded values for dictionary field {:?}",


I'm not overly familiar with the IPC spec but if it could support dict of dict encoding, can we update this error message to point to an issue about implementing it in the future?

Yeah, as far as I can tell, there is nothing that prevents dicts from dicts on IPC, and some forms of nested dicts are already supported, at least when they're not direct children (eg. dict(struct(dict))).

Looking at Schema.fbs, I think it actually is not capable of encoding a pure dict(dict)

arrow-rs/format/Schema.fbs

Lines 511 to 530 in 9b190c9

table Field {

/// Name is not required, in i.e. a List

name: string;

/// Whether or not this field can contain nulls. Should be true in general.

nullable: bool;

/// This is the type of the decoded value if the field is dictionary encoded.

type: Type;

/// Present only if the field is dictionary encoded.

dictionary: DictionaryEncoding;

/// children apply only to nested data types like Struct, List and Union. For

/// primitive types children will have length 0.

children: [ Field ];

/// User-defined metadata

custom_metadata: [ KeyValue ];

}

If we have Dictionary(i32, Dictionary(u32, Utf8)), then type here can't really be anything other than Utf8 since Type excludes a dictionary variant; and dictionary here doesn't seem to support nesting. So it seems physically incapable of encoding dict(dict(utf8)) and would always flatten to dict(utf8).

I tested on pyarrow 24.0.0 and it looks to be incapable of this roundtrip too:

import pyarrow as pa print(pa.__version__) inner = pa.array(["a", "b", "c"], pa.dictionary(pa.int64(), pa.utf8())) outer = pa.array(inner, pa.dictionary(pa.int64(), inner.type)) batch = pa.record_batch([outer], names=["c0"]) print(batch.schema) sink = pa.BufferOutputStream() with pa.ipc.new_stream(sink, batch.schema) as writer: for i in range(5): writer.write_batch(batch) buf = sink.getvalue() with pa.ipc.open_stream(buf) as reader: print(reader.schema) batches = [b for b in reader] print("Done")

~$ uv run --with pyarrow python3 -m nested_dict.py 24.0.0 c0: dictionary<values=dictionary<values=string, indices=int64, ordered=0>, indices=int64, ordered=0> c0: dictionary<values=string, indices=int64, ordered=0> Traceback (most recent call last): File "<frozen runpy>", line 189, in _run_module_as_main File "<frozen runpy>", line 112, in _get_module_details File "/Users/jeffrey/nested_dict.py", line 16, in <module> batches = [b for b in reader] ^^^^^^ File "pyarrow/ipc.pxi", line 757, in pyarrow.lib.RecordBatchReader.__next__ File "pyarrow/ipc.pxi", line 791, in pyarrow.lib.RecordBatchReader.read_next_batch File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status OSError: buffer_index out of range.

See how second print (for reader schema) shows its flattened from original schema

We could send a query to the arrow mailing list for clarification.

Thanks @Jefffrey makes sense. I updated the error message to frame this as an Arrow IPC field metadata limitation for direct dictionary-of-dictionary values.

The guard is still limited to direct dictionary-of-dictionary schemas, so existing indirect nested dictionary cases such as dict(struct(dict)) remain supported. I also removed the older buffer-matching explanation from the code. Let me know if any suggestions.

Jefffrey · 2026-06-30T02:54:06Z

+        let err = match crate::writer::StreamWriter::try_new(&mut stream, batch.schema_ref()) {
+            Ok(_) => panic!("IPC stream writer should reject dictionary-of-dictionary schemas"),
+            Err(err) => err,
+        };


Suggested change

let err = match crate::writer::StreamWriter::try_new(&mut stream, batch.schema_ref()) {

Ok(_) => panic!("IPC stream writer should reject dictionary-of-dictionary schemas"),

Err(err) => err,

};

let err = StreamWriter::try_new(&mut stream, batch.schema_ref()).unwrap_err();

same below

adjusted this to use let Err(err) = ... else { panic!(...) }. unwrap_err would require the successful writer type to implement Debug.

Jefffrey · 2026-06-30T02:55:09Z

+                // A dictionary batch writes this field's dictionary values as its payload.
+                // If that payload is itself dictionary-encoded, readers cannot match
+                // the encoded buffers to the schema metadata.


Suggested change

// A dictionary batch writes this field's dictionary values as its payload.

// If that payload is itself dictionary-encoded, readers cannot match

// the encoded buffers to the schema metadata.

We don't really need this detail; the underlying issue seems to be IPC format itself doesn't really support this

Done. removed the detailed comment and kept the explanation in the error message.

Jefffrey · 2026-06-30T02:55:46Z

+        .try_for_each(|field| ensure_supported_ipc_field(field))
+}
+
+fn ensure_supported_ipc_field(field: &Field) -> Result<(), ArrowError> {


nit: could inline this function

Done, inlined the helper.

Jefffrey · 2026-07-02T08:20:37Z

thanks @goutamadwant, @brancz & @thorfour

github-actions Bot added the arrow Changes to the arrow crate label Jun 28, 2026

goutamadwant mentioned this pull request Jun 28, 2026

Buffer count mismatched with metadata when encoding records with dictionary of dictionaries #10213

Closed

thorfour reviewed Jun 29, 2026

View reviewed changes

Jefffrey reviewed Jun 30, 2026

View reviewed changes

fix(ipc): reject dictionary-encoded dictionary values

0b6fd36

goutamadwant force-pushed the fix-ipc-dictionary-values branch from 0c6ab92 to 0b6fd36 Compare June 30, 2026 03:36

Merge branch 'main' into fix-ipc-dictionary-values

7a6771a

Jefffrey approved these changes Jul 1, 2026

View reviewed changes

Jefffrey merged commit d3d7ae8 into apache:main Jul 2, 2026
30 checks passed

	table Field {
	/// Name is not required, in i.e. a List
	name: string;

	/// Whether or not this field can contain nulls. Should be true in general.
	nullable: bool;

	/// This is the type of the decoded value if the field is dictionary encoded.
	type: Type;

	/// Present only if the field is dictionary encoded.
	dictionary: DictionaryEncoding;

	/// children apply only to nested data types like Struct, List and Union. For
	/// primitive types children will have length 0.
	children: [ Field ];

	/// User-defined metadata
	custom_metadata: [ KeyValue ];
	}

	// A dictionary batch writes this field's dictionary values as its payload.
	// If that payload is itself dictionary-encoded, readers cannot match
	// the encoded buffers to the schema metadata.

Uh oh!

Conversation

goutamadwant commented Jun 28, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

thorfour left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jefffrey commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants