Skip to content

fix: serialize patches and deltas containing lone surrogates#27

Open
spokodev wants to merge 1 commit into
JackuB:masterfrom
spokodev:fix/encode-lone-surrogate
Open

fix: serialize patches and deltas containing lone surrogates#27
spokodev wants to merge 1 commit into
JackuB:masterfrom
spokodev:fix/encode-lone-surrogate

Conversation

@spokodev

Copy link
Copy Markdown

Problem

patch_toText (via patch_obj.toString) and diff_toDelta call encodeURI(...), which throws URIError: URI malformed on a lone (unpaired) UTF-16 surrogate.

A lone surrogate can appear even when both inputs are fully valid Unicode. Patch context-slicing (Patch_Margin, measured in UTF-16 code units) can cut between a surrogate pair, leaving one half inside a segment. So patch_make / patch_apply succeed, but serializing the resulting patch throws:

const dmp = new diff_match_patch();
dmp.patch_toText(dmp.patch_make('ab😀😀', 'b😀😀'));
// URIError: URI malformed

Fixes #22 (reported by laurent22 / Joplin).

Fix

Two helpers, encodeURISurrogateSafe and decodeURISurrogateSafe:

  • For any string without lone surrogates they are byte-for-byte identical to encodeURI / decodeURI.
  • A lone surrogate is additionally encoded as its WTF-8 percent byte sequence (%ED%A0-BF%80-BF). decodeURI never produces a code point in the surrogate range U+D800..U+DFFF from valid UTF-8 (it rejects those bytes as malformed), so the decoder's restore step cannot collide with any legitimate decodeURI output. Collision-safety was brute-forced over all 1.1M code points: no valid code point's encodeURI output matches the restore regex.

Four call sites swapped: the two encode sites (diff_toDelta, patch toString) and the two decode sites (diff_fromDelta, patch_fromText).

Tests

Added testLoneSurrogate (issue #22 emoji repro round-trips through patch_fromText + patch_apply; the minimal ab😀😀 to b😀😀 serialization; a diff_toDelta lone-surrogate round-trip).

Verified red before / green after: reverting only the source change makes the new test fail with URIError: URI malformed; with the fix the full suite is green. Additionally fuzzed 5000 random patch round-trips and 5000 delta round-trips mixing emoji and lone surrogates, all passing with no throws.

patch_toText (patch_obj.toString) and diff_toDelta call encodeURI, which
throws "URIError: URI malformed" on a lone (unpaired) UTF-16 surrogate.
Patch context-slicing (Patch_Margin, measured in UTF-16 code units) can cut
between a surrogate pair, leaving a lone surrogate in a segment even when both
inputs are fully valid Unicode such as emoji. So patch_make/patch_apply work
but serializing the patch throws.

Add encodeURISurrogateSafe/decodeURISurrogateSafe: identical to
encodeURI/decodeURI for any string without lone surrogates, and additionally
encode a lone surrogate as its WTF-8 percent byte sequence (%ED%A0-BF%80-BF),
which decodeURI never produces from valid UTF-8 so the decoder cannot collide
with legitimate output. Swap the two encode sites (diff_toDelta, patch
toString) and the two decode sites (diff_fromDelta, patch_fromText).

Fixes JackuB#22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Library throws "URI malformed" error when creating patch with emojis

1 participant