Skip to content

fix(ipc): reject dictionary-encoded dictionary values#10230

Merged
Jefffrey merged 2 commits into
apache:mainfrom
goutamadwant:fix-ipc-dictionary-values
Jul 2, 2026
Merged

fix(ipc): reject dictionary-encoded dictionary values#10230
Jefffrey merged 2 commits into
apache:mainfrom
goutamadwant:fix-ipc-dictionary-values

Conversation

@goutamadwant

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

Arrow IPC dictionary batches write a dictionary field's values as the batch payload. When those values are themselves dictionary-encoded, the writer can produce IPC data that readers cannot decode, failing later with a buffer metadata mismatch.

What changes are included in this PR?

This adds schema validation to the IPC file and stream writer constructors so direct dictionary-of-dictionary schemas return a clear InvalidArgumentError before any IPC bytes are written. A low-level dictionary encoding guard is kept as a backstop.

A regression test covers the direct Dictionary(_, Dictionary(_, _)) case for both IPC stream and file writers.

Are these changes tested?

Yes.

  • cargo fmt --all -- --check
  • cargo test -p arrow-ipc
  • cargo test -p arrow-ipc --all-features
  • cargo clippy -p arrow-ipc --all-targets -- -D warnings
  • cargo clippy -p arrow-ipc --all-targets --all-features -- -D warnings

Are there any user-facing changes?

Yes. The IPC file and stream writers now return a clear error for direct dictionary-of-dictionary schemas instead of writing IPC data that fails during read. There are no public API changes.

@thorfour thorfour left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! This will be a helpful guard

Comment thread arrow-ipc/src/writer.rs Outdated
// the encoded buffers to the schema metadata.
if matches!(value_type.as_ref(), DataType::Dictionary(_, _)) {
return Err(ArrowError::InvalidArgumentError(format!(
"IPC does not support dictionary-encoded values for dictionary field {:?}",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not overly familiar with the IPC spec but if it could support dict of dict encoding, can we update this error message to point to an issue about implementing it in the future?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, as far as I can tell, there is nothing that prevents dicts from dicts on IPC, and some forms of nested dicts are already supported, at least when they're not direct children (eg. dict(struct(dict))).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at Schema.fbs, I think it actually is not capable of encoding a pure dict(dict)

arrow-rs/format/Schema.fbs

Lines 511 to 530 in 9b190c9

table Field {
/// Name is not required, in i.e. a List
name: string;
/// Whether or not this field can contain nulls. Should be true in general.
nullable: bool;
/// This is the type of the decoded value if the field is dictionary encoded.
type: Type;
/// Present only if the field is dictionary encoded.
dictionary: DictionaryEncoding;
/// children apply only to nested data types like Struct, List and Union. For
/// primitive types children will have length 0.
children: [ Field ];
/// User-defined metadata
custom_metadata: [ KeyValue ];
}

If we have Dictionary(i32, Dictionary(u32, Utf8)), then type here can't really be anything other than Utf8 since Type excludes a dictionary variant; and dictionary here doesn't seem to support nesting. So it seems physically incapable of encoding dict(dict(utf8)) and would always flatten to dict(utf8).

I tested on pyarrow 24.0.0 and it looks to be incapable of this roundtrip too:

import pyarrow as pa

print(pa.__version__)

inner = pa.array(["a", "b", "c"], pa.dictionary(pa.int64(), pa.utf8()))
outer = pa.array(inner, pa.dictionary(pa.int64(), inner.type))
batch = pa.record_batch([outer], names=["c0"])
print(batch.schema)
sink = pa.BufferOutputStream()
with pa.ipc.new_stream(sink, batch.schema) as writer:
    for i in range(5):
        writer.write_batch(batch)
buf = sink.getvalue()
with pa.ipc.open_stream(buf) as reader:
    print(reader.schema)
    batches = [b for b in reader]
print("Done")
~$ uv run --with pyarrow python3 -m nested_dict.py
24.0.0
c0: dictionary<values=dictionary<values=string, indices=int64, ordered=0>, indices=int64, ordered=0>
c0: dictionary<values=string, indices=int64, ordered=0>
Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 112, in _get_module_details
  File "/Users/jeffrey/nested_dict.py", line 16, in <module>
    batches = [b for b in reader]
                          ^^^^^^
  File "pyarrow/ipc.pxi", line 757, in pyarrow.lib.RecordBatchReader.__next__
  File "pyarrow/ipc.pxi", line 791, in pyarrow.lib.RecordBatchReader.read_next_batch
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: buffer_index out of range.
  • See how second print (for reader schema) shows its flattened from original schema

We could send a query to the arrow mailing list for clarification.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Jefffrey makes sense. I updated the error message to frame this as an Arrow IPC field metadata limitation for direct dictionary-of-dictionary values.

The guard is still limited to direct dictionary-of-dictionary schemas, so existing indirect nested dictionary cases such as dict(struct(dict)) remain supported. I also removed the older buffer-matching explanation from the code. Let me know if any suggestions.

Comment thread arrow-ipc/src/reader.rs Outdated
Comment on lines +2716 to +2719
let err = match crate::writer::StreamWriter::try_new(&mut stream, batch.schema_ref()) {
Ok(_) => panic!("IPC stream writer should reject dictionary-of-dictionary schemas"),
Err(err) => err,
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let err = match crate::writer::StreamWriter::try_new(&mut stream, batch.schema_ref()) {
Ok(_) => panic!("IPC stream writer should reject dictionary-of-dictionary schemas"),
Err(err) => err,
};
let err = StreamWriter::try_new(&mut stream, batch.schema_ref()).unwrap_err();

same below

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjusted this to use let Err(err) = ... else { panic!(...) }. unwrap_err would require the successful writer type to implement Debug.

Comment thread arrow-ipc/src/writer.rs Outdated
Comment on lines +540 to +542
// A dictionary batch writes this field's dictionary values as its payload.
// If that payload is itself dictionary-encoded, readers cannot match
// the encoded buffers to the schema metadata.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// A dictionary batch writes this field's dictionary values as its payload.
// If that payload is itself dictionary-encoded, readers cannot match
// the encoded buffers to the schema metadata.

We don't really need this detail; the underlying issue seems to be IPC format itself doesn't really support this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. removed the detailed comment and kept the explanation in the error message.

Comment thread arrow-ipc/src/writer.rs Outdated
.try_for_each(|field| ensure_supported_ipc_field(field))
}

fn ensure_supported_ipc_field(field: &Field) -> Result<(), ArrowError> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could inline this function

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, inlined the helper.

@goutamadwant goutamadwant force-pushed the fix-ipc-dictionary-values branch from 0c6ab92 to 0b6fd36 Compare June 30, 2026 03:36
@Jefffrey Jefffrey merged commit d3d7ae8 into apache:main Jul 2, 2026
30 checks passed
@Jefffrey

Jefffrey commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

thanks @goutamadwant, @brancz & @thorfour

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Buffer count mismatched with metadata when encoding records with dictionary of dictionaries

4 participants