Skip to content

fix: match node:querystring on invalid-UTF-8 escapes (U+FFFD, not raw)#94

Open
spokodev wants to merge 1 commit into
anonrig:mainfrom
spokodev:fix/parse-invalid-utf8-replacement
Open

fix: match node:querystring on invalid-UTF-8 escapes (U+FFFD, not raw)#94
spokodev wants to merge 1 commit into
anonrig:mainfrom
spokodev:fix/parse-invalid-utf8-replacement

Conversation

@spokodev

Copy link
Copy Markdown

Problem

parse() returns the raw, undecoded %XX text for syntactically valid
percent-escapes that decode to invalid UTF-8. The README positions this
module as a replacement for node:querystring ("replace the legacy
node:querystring module", "Supports both parse and stringify methods
from node:querystring"), and the test suite ships a parity test against it,
so node:querystring is the oracle. On this class of input the two diverge:

const querystring = require("node:querystring");
const fqs = require("fast-querystring");

querystring.parse("a=%C3"); // { a: '�' }  (replacement character)
fqs.parse("a=%C3");         // { a: '%C3' }      (raw, undecoded)  <-- bug
input node:querystring fast-querystring (before)
a=%C3 {a:"�"} {a:"%C3"}
a=%ff {a:"�"} {a:"%ff"}
a=%E4%B8 {a:"�"} {a:"%E4%B8"}
a=%C0%80 {a:"��"} {a:"%C0%80"}

It affects keys as well as values, since both use the same decode path.
Valid multibyte (a=%E4%B8%AD -> ) and non-hex escapes (a=%zx -> %zx)
already match node:querystring; only the invalid-UTF-8 class diverges.

Cause

lib/parse.js decodes via fast-decode-uri-component, which returns null
for invalid UTF-8. The fastDecode(x) || x fallback then keeps the raw %XX
text, whereas node:querystring substitutes U+FFFD via its lenient decoder.

Fix

A small fallback decoder (lib/internals/decode.js) reproduces what
node:querystring does on these tokens: when at least one decodable %XX
escape is present, each code unit is masked to a single byte and run through a
non-fatal UTF-8 decode (so invalid sequences become U+FFFD); when only
malformed escapes are present (e.g. %zz), the token is returned unchanged,
matching node:querystring.

The fallback runs only when fastDecode returns null (the rare invalid
case). On valid input fastDecode returns a string and the new code path is
never entered, so the fast path is unchanged.

Verification

  • Full suite green, plus added parity tests asserting parse() equals
    node:querystring.parse() for the invalid-UTF-8 class in keys and values.
  • Differential of parse() vs node:querystring.parse() over 500k random
    query strings: every previously-passing case still matches, and the entire
    invalid-UTF-8 class now matches (0 cases that this change makes worse).
  • Over 1,000,000 ASCII-input query strings (the domain real query strings come
    from, since non-ASCII is always percent-encoded) parse() now matches
    node:querystring.parse() exactly.
  • Verified on Node 24 and Node 26.

parse() returned the raw, undecoded "%XX" text for syntactically valid
percent-escapes that decode to invalid UTF-8 (e.g. parse("a=%C3") gave
{a:"%C3"}). node:querystring, the module README positions this as a
replacement for, substitutes U+FFFD instead ({a:"�"}). This affected
both keys and values.

fast-decode-uri-component returns null on invalid UTF-8, and the
`fastDecode(x) || x` fallback kept the raw escape. Add a lenient fallback
decoder that reproduces node:querystring exactly: when a decodable %XX
escape is present, mask each code unit to one byte and run it through a
non-fatal UTF-8 decode (yielding U+FFFD on bad sequences); when only
malformed escapes are present, return the token unchanged.

The fallback runs only when fastDecode returns null, so the valid-input
fast path is unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant