fix(tokenizer): treat <![CDATA[ as a bogus comment in HTML mode#2458
fix(tokenizer): treat <![CDATA[ as a bogus comment in HTML mode#2458spokodev wants to merge 1 commit into
<![CDATA[ as a bogus comment in HTML mode#2458Conversation
In HTML mode `<![CDATA[` outside foreign content must be tokenized as a bogus comment that ends at the first `>` (WHATWG HTML §13.2.5.42 Markup declaration open state, §13.2.5.43 Bogus comment state). The tokenizer instead entered the comment-like CDATA path on every full `CDATA[` match, so the section only ended at `]]>` (or EOF) and swallowed all following markup into one inert comment. `stateCDATASequence` now only takes the real-CDATA path when CDATA should be recognized (`xmlMode`, `recognizeCDATA`, or foreign content), mirroring the existing partial-match branch which already routes to a bogus comment in HTML mode. xmlMode, foreign-content, and `recognizeCDATA` behavior is unchanged. Before: `<![CDATA[x><img>` -> comment `[CDATA[x><img>` (img hidden). After: `<![CDATA[x><img>` -> comment `[CDATA[x` + live `<img>`, matching parse5 and browsers.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds a ChangesCDATA Recognition
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
In default HTML mode,
<![CDATA[outside foreign content is tokenized with the wrong end-condition. On a fullCDATA[match,stateCDATASequencealways enters the comment-like CDATA path (InCommentLikewith the]]>end sequence), so the section only ends at]]>(or EOF). Per the WHATWG HTML spec,<![CDATA[in this position is a bogus comment that ends at the first>, leaving everything after>as live markup.Because of this, htmlparser2 collapses a CDATA-prefixed string plus all following markup into a single inert comment node, instead of emitting a short comment followed by the live elements.
Spec
<![CDATA[is only consumed as a CDATA section when there is an adjusted current node that is not an element in the HTML namespace (foreign content); otherwise it is a parse error and the tokenizer switches to the bogus comment state.>.Reproduction
Current
master:parse5 (used here as a differential spec oracle) and every browser:
The trailing
onerrorattribute is only there to make the divergence visible: in the buggy output the markup after>is hidden inside a comment; in the spec output it is a realimgelement. This is a parser-correctness issue, not an exploit report.Prior acknowledgement
This is the same behavior reported in #119 ("CDATA parsing is not correct when it is not in xmlMode"). The reporter verified against Gumbo and parse5 that
<![CDATA[ > <div> xxx </div>]]>should yield a bogus comment<![CDATA[ >followed by live<div>xxx</div>markup, and the maintainer confirmed: "Ahh, right. I forgot it behaved this way. This will be fixed with #114." #114 (the high5 tokenizer rewrite) never landed, so the bug is still present onmaster.It also runs counter to the recent spec-alignment work in #2382, #2383, and #2387.
Fix
stateCDATASequencenow only takes the real-CDATA path (InCommentLike+]]>) when CDATA should actually be recognized, via a newshouldRecognizeCDATA()helper that checksxmlMode || recognizeCDATA || isInForeignContext(). In default HTML mode it routes toInSpecialComment(bogus comment ending at>), exactly like the existing partial-match branch a few lines below already does.recognizeCDATAis read from the options already passed to the tokenizer.xmlMode, foreign-content (SVG/MathML), andrecognizeCDATAreal-CDATA behavior is preserved.Tests
Added to
src/index.spec.ts:<![CDATA[x><img src=x>in HTML mode yields a comment ending at>plus a liveimgelement (mirrors the parse5 tree above; fails onmaster, passes with this change);xmlModeis still recognized.Full suite green:
No snapshot files changed, so existing CDATA tokenizer cases (
<![CDATA[ foo ]]>,<![CD>, unclosed-at-EOF,recognizeCDATAedge cases) keep their current behavior.Summary by CodeRabbit
New Features
<![CDATA[so it behaves correctly in XML mode and foreign content.Bug Fixes
<![CDATA[is now treated as a bogus comment outside foreign content, preventing unintended parsing of following markup.