Describe the bug
When a Dictionary(K, V) column is used as a GROUP BY key, the output schema is fixed at plan time with the original key type K. If the number of distinct groups seen at runtime exceeds K's representable range, the query errors, even though the input data and the query are completely valid.
core problem
DataFusion's execution model locks the output schema at planning time. For dictionary columns this means the key type is fixed (e.g. UInt8). There is currently no mechanism for the key type to grow during execution to reflect the actual observed cardinality. The schema becomes a correctness constraint: a query over a Dict<UInt8, Utf8> column with 300 distinct values is valid SQL, but DataFusion will reject it at runtime purely because the pre-committed key type cannot represent 300 distinct indices.
To Reproduce
Where it fails today
The only current execution path for dictionary group keys is GroupValuesRows, which materializes values internally and then re-encodes back to the original dictionary type on emit. That re-encoding is where the key-type overflow surfaces as an opaque Arrow cast error. But the root cause is not GroupValuesRows specifically, any future specialised path would face the same constraint as long as the output schema's key type is fixed.
#[test]
fn dict_uint8_utf8_257_groups_emit() -> Result<()> {
let schema = Arc::new(Schema::new(vec![Field::new(
"k",
DataType::Dictionary(Box::new(DataType::UInt8), Box::new(DataType::Utf8)),
false,
)]));
let mut gv = GroupValuesRows::try_new(Arc::clone(&schema))?;
let mut groups = vec![];
for i in 0u32..257 {
let mut builder = StringDictionaryBuilder::<UInt8Type>::new();
builder.append_value(format!("group_{i}"));
let arr: ArrayRef = Arc::new(builder.finish());
gv.intern(&[arr], &mut groups)?;
}
assert_eq!(gv.len(), 257);
let result = gv.emit(EmitTo::All);
match &result {
Ok(arrays) => {
assert_eq!(arrays.len(), 1, "expected one output column");
let out = &arrays[0];
assert_eq!(
out.len(),
257,
"emitted array should still have 257 rows; got {}",
out.len()
);
}
Err(e) => {
println!("emit returned error (expected for overflow): {e}");
}
}
Ok(())
}
This produces the following error
emit returned error (expected for overflow): Arrow error: Dictionary key bigger than the key type
Expected behavior
The output key type should be treated as a minimum width, not a maximum. When the observed distinct count exceeds the planned key type's range, the engine should promote the key type to the next sufficient width.
Additional context
@kumarUjjawal mentioned a similar idea here
@tustvold encountered some issues with dictionary array schemas
@zhuqi-lucas is working on including nested types for GroupValuesColumn so this work may align.
Describe the bug
When a
Dictionary(K, V)column is used as a GROUP BY key, the output schema is fixed at plan time with the original key type K. If the number of distinct groups seen at runtime exceeds K's representable range, the query errors, even though the input data and the query are completely valid.core problem
DataFusion's execution model locks the output schema at planning time. For dictionary columns this means the key type is fixed (e.g. UInt8). There is currently no mechanism for the key type to grow during execution to reflect the actual observed cardinality. The schema becomes a correctness constraint: a query over a
Dict<UInt8, Utf8>column with 300 distinct values is valid SQL, but DataFusion will reject it at runtime purely because the pre-committed key type cannot represent 300 distinct indices.To Reproduce
Where it fails today
The only current execution path for dictionary group keys is
GroupValuesRows, which materializes values internally and then re-encodes back to the original dictionary type on emit. That re-encoding is where the key-type overflow surfaces as an opaque Arrow cast error. But the root cause is not GroupValuesRows specifically, any future specialised path would face the same constraint as long as the output schema's key type is fixed.This produces the following error
Expected behavior
The output key type should be treated as a minimum width, not a maximum. When the observed distinct count exceeds the planned key type's range, the engine should promote the key type to the next sufficient width.
Additional context
@kumarUjjawal mentioned a similar idea here
@tustvold encountered some issues with dictionary array schemas
@zhuqi-lucas is working on including nested types for
GroupValuesColumnso this work may align.