fix: order StringType.compare by code point to match spec#508
Open
spokodev wants to merge 2 commits into
Open
Conversation
StringType inherited PrimitiveType.compare, which orders strings with the JS `<` operator (UTF-16 code unit order). The Avro spec requires strings to sort by Unicode code point, identical to the UTF-8 byte order that compareBuffers and _match already use. The two orderings diverge for supplementary plane characters, so avsc's own public comparison APIs disagreed for the same values. Add StringType.prototype.compare that walks both strings by code point, matching compareBuffers/_match. RecordType and ArrayType compare delegate to the field comparator, so they are fixed too.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
StringType.compare()andcompareBuffers()disagree for the same string values:Both are public comparison APIs, so a caller cannot rely on a single sort order.
Spec
The Avro spec Sort order defines string ordering as:
So code-point order and UTF-8 byte order are identical, and
compareBuffers/_matchalready follow it.comparedid not.Root cause
StringTypedefines nocompare, so it inheritsPrimitiveType.prototype.compare, which isutils.compare(a, b)— a plaina < b. For strings,<compares by UTF-16 code unit, not code point. The two diverge for supplementary plane (astral) characters: a surrogate lead unit (0xD800–0xDBFF) sorts before BMP code pointsU+E000–U+FFFF, whereas by code point the astral character sorts after. MeanwhileStringType._matchusestap1.matchBytes(tap2)(UTF-8), so the two APIs disagree.Fix
Add
StringType.prototype.compare, walking both strings by code point. This matches_match/compareBuffersand the spec.RecordTypeandArrayTypecomparedelegate to the field comparator, so they are corrected for string fields too.Verification
Red before, green after; added two tests in the
StringTypesuite assertingcompareagrees withcompareBuffersand follows code-point order for astral characters. Full suite passes (504 tests).