fix: serialize patches and deltas containing lone surrogates#27
Open
spokodev wants to merge 1 commit into
Open
fix: serialize patches and deltas containing lone surrogates#27spokodev wants to merge 1 commit into
spokodev wants to merge 1 commit into
Conversation
patch_toText (patch_obj.toString) and diff_toDelta call encodeURI, which throws "URIError: URI malformed" on a lone (unpaired) UTF-16 surrogate. Patch context-slicing (Patch_Margin, measured in UTF-16 code units) can cut between a surrogate pair, leaving a lone surrogate in a segment even when both inputs are fully valid Unicode such as emoji. So patch_make/patch_apply work but serializing the patch throws. Add encodeURISurrogateSafe/decodeURISurrogateSafe: identical to encodeURI/decodeURI for any string without lone surrogates, and additionally encode a lone surrogate as its WTF-8 percent byte sequence (%ED%A0-BF%80-BF), which decodeURI never produces from valid UTF-8 so the decoder cannot collide with legitimate output. Swap the two encode sites (diff_toDelta, patch toString) and the two decode sites (diff_fromDelta, patch_fromText). Fixes JackuB#22
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
patch_toText(viapatch_obj.toString) anddiff_toDeltacallencodeURI(...), which throwsURIError: URI malformedon a lone (unpaired) UTF-16 surrogate.A lone surrogate can appear even when both inputs are fully valid Unicode. Patch context-slicing (
Patch_Margin, measured in UTF-16 code units) can cut between a surrogate pair, leaving one half inside a segment. Sopatch_make/patch_applysucceed, but serializing the resulting patch throws:Fixes #22 (reported by laurent22 / Joplin).
Fix
Two helpers,
encodeURISurrogateSafeanddecodeURISurrogateSafe:encodeURI/decodeURI.%ED%A0-BF%80-BF).decodeURInever produces a code point in the surrogate rangeU+D800..U+DFFFfrom valid UTF-8 (it rejects those bytes as malformed), so the decoder's restore step cannot collide with any legitimatedecodeURIoutput. Collision-safety was brute-forced over all 1.1M code points: no valid code point'sencodeURIoutput matches the restore regex.Four call sites swapped: the two encode sites (
diff_toDelta, patchtoString) and the two decode sites (diff_fromDelta,patch_fromText).Tests
Added
testLoneSurrogate(issue #22 emoji repro round-trips throughpatch_fromText+patch_apply; the minimalab😀😀tob😀😀serialization; adiff_toDeltalone-surrogate round-trip).Verified red before / green after: reverting only the source change makes the new test fail with
URIError: URI malformed; with the fix the full suite is green. Additionally fuzzed 5000 random patch round-trips and 5000 delta round-trips mixing emoji and lone surrogates, all passing with no throws.