Skip to content

Optimize dictionary writers by replacing fastutil Linked maps with OpenHashMap + ArrayList #3513

@iemejia

Description

@iemejia

Background

The dictionary writers in DictionaryValuesWriter (binary, long, double, float, int variants) use fastutil's *2IntLinkedOpenHashMap to assign dictionary indices to distinct values. The "Linked" variant is used because the dictionary page must be emitted in insertion order (dictionary index i corresponds to the i-th distinct value seen), and the linked map preserves that order via a doubly-linked list threaded through the slot array.

This is correct, but the linked variant pays an avoidable cost on every insertion:

  • Two extra long fields per slot (prev, next indices) → larger slots, more cache lines per probe
  • Pointer fix-up on insert: update the previous tail's next, the new entry's prev, and the lastEntry pointer (3–4 scattered writes per insert)
  • Re-stitching the linked list on rehash
  • Pure pointer chasing — not vectorizable, not branch-friendly

For high-cardinality columns (hundreds of thousands of distinct values per column chunk), this overhead compounds on a hot path.

Proposal

Replace each *2IntLinkedOpenHashMap with the plain *2IntOpenHashMap plus a separate primitive-typed list (IntArrayList / LongArrayList / FloatArrayList / DoubleArrayList / ArrayList<Binary>) to track insertion order:

  • The hash map becomes a pure "have I seen this? what's its id?" lookup with the smallest possible slot.
  • The list is append-only, contiguous, and cache-friendly to iterate at flush time.

The two responsibilities (lookup vs ordering) were jammed into one structure; splitting them lets each be optimal. The trade-off is one extra primitive entry per distinct value, which is small compared to the per-slot overhead the linked map was paying.

Note: both candidates are fastutil primitive-keyed maps, so this is not a boxing change. The win is structural — eliminating an ordering guarantee that was being paid for on every insert when an explicit append-only list provides it more cheaply.

Expected impact

From local benchmark runs (BinaryEncodingBenchmark.encodeDictionary, IntEncodingBenchmark.encodeDictionary — being added in #3512):

  • BinaryEncodingBenchmark.encodeDictionary (high cardinality, short strings): +23% to +42% throughput
  • IntEncodingBenchmark.encodeDictionary (high cardinality): ~+2x in some configurations
  • Low-cardinality cases are flat or trivially different (the linked-list overhead doesn't matter when there are few inserts)

Files affected

  • parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java — the 5 inner classes PlainBinaryDictionaryValuesWriter, PlainLongDictionaryValuesWriter, PlainDoubleDictionaryValuesWriter, PlainFloatDictionaryValuesWriter, PlainIntegerDictionaryValuesWriter.

No public API change. No file format change. Behavior is identical (dictionary pages emit values in the same order).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions