fix(ipc): reject dictionary-encoded dictionary values#10230
Conversation
thorfour
left a comment
There was a problem hiding this comment.
Thanks for this! This will be a helpful guard
| // the encoded buffers to the schema metadata. | ||
| if matches!(value_type.as_ref(), DataType::Dictionary(_, _)) { | ||
| return Err(ArrowError::InvalidArgumentError(format!( | ||
| "IPC does not support dictionary-encoded values for dictionary field {:?}", |
There was a problem hiding this comment.
I'm not overly familiar with the IPC spec but if it could support dict of dict encoding, can we update this error message to point to an issue about implementing it in the future?
There was a problem hiding this comment.
Yeah, as far as I can tell, there is nothing that prevents dicts from dicts on IPC, and some forms of nested dicts are already supported, at least when they're not direct children (eg. dict(struct(dict))).
There was a problem hiding this comment.
Looking at Schema.fbs, I think it actually is not capable of encoding a pure dict(dict)
Lines 511 to 530 in 9b190c9
If we have Dictionary(i32, Dictionary(u32, Utf8)), then type here can't really be anything other than Utf8 since Type excludes a dictionary variant; and dictionary here doesn't seem to support nesting. So it seems physically incapable of encoding dict(dict(utf8)) and would always flatten to dict(utf8).
I tested on pyarrow 24.0.0 and it looks to be incapable of this roundtrip too:
import pyarrow as pa
print(pa.__version__)
inner = pa.array(["a", "b", "c"], pa.dictionary(pa.int64(), pa.utf8()))
outer = pa.array(inner, pa.dictionary(pa.int64(), inner.type))
batch = pa.record_batch([outer], names=["c0"])
print(batch.schema)
sink = pa.BufferOutputStream()
with pa.ipc.new_stream(sink, batch.schema) as writer:
for i in range(5):
writer.write_batch(batch)
buf = sink.getvalue()
with pa.ipc.open_stream(buf) as reader:
print(reader.schema)
batches = [b for b in reader]
print("Done")~$ uv run --with pyarrow python3 -m nested_dict.py
24.0.0
c0: dictionary<values=dictionary<values=string, indices=int64, ordered=0>, indices=int64, ordered=0>
c0: dictionary<values=string, indices=int64, ordered=0>
Traceback (most recent call last):
File "<frozen runpy>", line 189, in _run_module_as_main
File "<frozen runpy>", line 112, in _get_module_details
File "/Users/jeffrey/nested_dict.py", line 16, in <module>
batches = [b for b in reader]
^^^^^^
File "pyarrow/ipc.pxi", line 757, in pyarrow.lib.RecordBatchReader.__next__
File "pyarrow/ipc.pxi", line 791, in pyarrow.lib.RecordBatchReader.read_next_batch
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: buffer_index out of range.- See how second print (for reader schema) shows its flattened from original schema
We could send a query to the arrow mailing list for clarification.
There was a problem hiding this comment.
Thanks @Jefffrey makes sense. I updated the error message to frame this as an Arrow IPC field metadata limitation for direct dictionary-of-dictionary values.
The guard is still limited to direct dictionary-of-dictionary schemas, so existing indirect nested dictionary cases such as dict(struct(dict)) remain supported. I also removed the older buffer-matching explanation from the code. Let me know if any suggestions.
| let err = match crate::writer::StreamWriter::try_new(&mut stream, batch.schema_ref()) { | ||
| Ok(_) => panic!("IPC stream writer should reject dictionary-of-dictionary schemas"), | ||
| Err(err) => err, | ||
| }; |
There was a problem hiding this comment.
| let err = match crate::writer::StreamWriter::try_new(&mut stream, batch.schema_ref()) { | |
| Ok(_) => panic!("IPC stream writer should reject dictionary-of-dictionary schemas"), | |
| Err(err) => err, | |
| }; | |
| let err = StreamWriter::try_new(&mut stream, batch.schema_ref()).unwrap_err(); |
same below
There was a problem hiding this comment.
adjusted this to use let Err(err) = ... else { panic!(...) }. unwrap_err would require the successful writer type to implement Debug.
| // A dictionary batch writes this field's dictionary values as its payload. | ||
| // If that payload is itself dictionary-encoded, readers cannot match | ||
| // the encoded buffers to the schema metadata. |
There was a problem hiding this comment.
| // A dictionary batch writes this field's dictionary values as its payload. | |
| // If that payload is itself dictionary-encoded, readers cannot match | |
| // the encoded buffers to the schema metadata. |
We don't really need this detail; the underlying issue seems to be IPC format itself doesn't really support this
There was a problem hiding this comment.
Done. removed the detailed comment and kept the explanation in the error message.
| .try_for_each(|field| ensure_supported_ipc_field(field)) | ||
| } | ||
|
|
||
| fn ensure_supported_ipc_field(field: &Field) -> Result<(), ArrowError> { |
There was a problem hiding this comment.
nit: could inline this function
There was a problem hiding this comment.
Done, inlined the helper.
0c6ab92 to
0b6fd36
Compare
|
thanks @goutamadwant, @brancz & @thorfour |
Which issue does this PR close?
Rationale for this change
Arrow IPC dictionary batches write a dictionary field's values as the batch payload. When those values are themselves dictionary-encoded, the writer can produce IPC data that readers cannot decode, failing later with a buffer metadata mismatch.
What changes are included in this PR?
This adds schema validation to the IPC file and stream writer constructors so direct dictionary-of-dictionary schemas return a clear
InvalidArgumentErrorbefore any IPC bytes are written. A low-level dictionary encoding guard is kept as a backstop.A regression test covers the direct
Dictionary(_, Dictionary(_, _))case for both IPC stream and file writers.Are these changes tested?
Yes.
cargo fmt --all -- --checkcargo test -p arrow-ipccargo test -p arrow-ipc --all-featurescargo clippy -p arrow-ipc --all-targets -- -D warningscargo clippy -p arrow-ipc --all-targets --all-features -- -D warningsAre there any user-facing changes?
Yes. The IPC file and stream writers now return a clear error for direct dictionary-of-dictionary schemas instead of writing IPC data that fails during read. There are no public API changes.