Skip to content

fix: order StringType.compare by code point to match spec#508

Open
spokodev wants to merge 2 commits into
mtth:mainfrom
spokodev:fix/string-compare-codepoint-order
Open

fix: order StringType.compare by code point to match spec#508
spokodev wants to merge 2 commits into
mtth:mainfrom
spokodev:fix/string-compare-codepoint-order

Conversation

@spokodev

Copy link
Copy Markdown

Problem

StringType.compare() and compareBuffers() disagree for the same string values:

const type = avsc.Type.forSchema('string');
const a = '', b = '\u{1F600}'; // U+E000 vs U+1F600

type.compare(a, b);                                       // +1  (wrong)
type.compareBuffers(type.toBuffer(a), type.toBuffer(b));  // -1  (correct)

Both are public comparison APIs, so a caller cannot rely on a single sort order.

Spec

The Avro spec Sort order defines string ordering as:

strings are compared lexicographically by Unicode code point. Note that since UTF-8 is used as the encoding for strings, sorting the bytes (which is what compareBuffers does) gives the same result as comparing the code points.

So code-point order and UTF-8 byte order are identical, and compareBuffers/_match already follow it. compare did not.

Root cause

StringType defines no compare, so it inherits PrimitiveType.prototype.compare, which is utils.compare(a, b) — a plain a < b. For strings, < compares by UTF-16 code unit, not code point. The two diverge for supplementary plane (astral) characters: a surrogate lead unit (0xD8000xDBFF) sorts before BMP code points U+E000U+FFFF, whereas by code point the astral character sorts after. Meanwhile StringType._match uses tap1.matchBytes(tap2) (UTF-8), so the two APIs disagree.

Fix

Add StringType.prototype.compare, walking both strings by code point. This matches _match/compareBuffers and the spec. RecordType and ArrayType compare delegate to the field comparator, so they are corrected for string fields too.

Verification

Red before, green after; added two tests in the StringType suite asserting compare agrees with compareBuffers and follows code-point order for astral characters. Full suite passes (504 tests).

webmaster128 and others added 2 commits April 9, 2025 21:13
StringType inherited PrimitiveType.compare, which orders strings with
the JS `<` operator (UTF-16 code unit order). The Avro spec requires
strings to sort by Unicode code point, identical to the UTF-8 byte
order that compareBuffers and _match already use. The two orderings
diverge for supplementary plane characters, so avsc's own public
comparison APIs disagreed for the same values.

Add StringType.prototype.compare that walks both strings by code point,
matching compareBuffers/_match. RecordType and ArrayType compare
delegate to the field comparator, so they are fixed too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants