Skip to content

GROUP BY on a Dictionary column fails at runtime when distinct group count exceeds the key type's capacity #23127

Description

@Rich-T-kid

Describe the bug

When a Dictionary(K, V) column is used as a GROUP BY key, the output schema is fixed at plan time with the original key type K. If the number of distinct groups seen at runtime exceeds K's representable range, the query errors, even though the input data and the query are completely valid.

core problem

DataFusion's execution model locks the output schema at planning time. For dictionary columns this means the key type is fixed (e.g. UInt8). There is currently no mechanism for the key type to grow during execution to reflect the actual observed cardinality. The schema becomes a correctness constraint: a query over a Dict<UInt8, Utf8> column with 300 distinct values is valid SQL, but DataFusion will reject it at runtime purely because the pre-committed key type cannot represent 300 distinct indices.

To Reproduce

Where it fails today

The only current execution path for dictionary group keys is GroupValuesRows, which materializes values internally and then re-encodes back to the original dictionary type on emit. That re-encoding is where the key-type overflow surfaces as an opaque Arrow cast error. But the root cause is not GroupValuesRows specifically, any future specialised path would face the same constraint as long as the output schema's key type is fixed.

#[test]
    fn dict_uint8_utf8_257_groups_emit() -> Result<()> {
        let schema = Arc::new(Schema::new(vec![Field::new(
            "k",
            DataType::Dictionary(Box::new(DataType::UInt8), Box::new(DataType::Utf8)),
            false,
        )]));

        let mut gv = GroupValuesRows::try_new(Arc::clone(&schema))?;

        let mut groups = vec![];
        for i in 0u32..257 {
            let mut builder = StringDictionaryBuilder::<UInt8Type>::new();
            builder.append_value(format!("group_{i}"));
            let arr: ArrayRef = Arc::new(builder.finish());
            gv.intern(&[arr], &mut groups)?;
        }

        assert_eq!(gv.len(), 257);

        let result = gv.emit(EmitTo::All);
        match &result {
            Ok(arrays) => {
                assert_eq!(arrays.len(), 1, "expected one output column");
                let out = &arrays[0];
                assert_eq!(
                    out.len(),
                    257,
                    "emitted array should still have 257 rows; got {}",
                    out.len()
                );
            }
            Err(e) => {
                println!("emit returned error (expected for overflow): {e}");
            }
        }

        Ok(())
    }

This produces the following error

emit returned error (expected for overflow): Arrow error: Dictionary key bigger than the key type

Expected behavior

The output key type should be treated as a minimum width, not a maximum. When the observed distinct count exceeds the planned key type's range, the engine should promote the key type to the next sufficient width.

Additional context

@kumarUjjawal mentioned a similar idea here
@tustvold encountered some issues with dictionary array schemas
@zhuqi-lucas is working on including nested types for GroupValuesColumn so this work may align.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions