Skip to content

fix(tokenizer): treat <![CDATA[ as a bogus comment in HTML mode#2458

Open
spokodev wants to merge 1 commit into
fb55:masterfrom
spokodev:fix/html-mode-cdata-bogus-comment
Open

fix(tokenizer): treat <![CDATA[ as a bogus comment in HTML mode#2458
spokodev wants to merge 1 commit into
fb55:masterfrom
spokodev:fix/html-mode-cdata-bogus-comment

Conversation

@spokodev

@spokodev spokodev commented Jun 30, 2026

Copy link
Copy Markdown

Summary

In default HTML mode, <![CDATA[ outside foreign content is tokenized with the wrong end-condition. On a full CDATA[ match, stateCDATASequence always enters the comment-like CDATA path (InCommentLike with the ]]> end sequence), so the section only ends at ]]> (or EOF). Per the WHATWG HTML spec, <![CDATA[ in this position is a bogus comment that ends at the first >, leaving everything after > as live markup.

Because of this, htmlparser2 collapses a CDATA-prefixed string plus all following markup into a single inert comment node, instead of emitting a short comment followed by the live elements.

Spec

Reproduction

import * as htmlparser2 from "htmlparser2";
const dom = htmlparser2.parseDocument("<![CDATA[x><img src=x onerror=alert(1)>");

Current master:

COMMENT "[CDATA[x><img src=x onerror=alert(1)>"   // single node, <img> never parsed

parse5 (used here as a differential spec oracle) and every browser:

COMMENT "[CDATA[x"
ELEMENT <img src=x onerror=alert(1)>              // live element

The trailing onerror attribute is only there to make the divergence visible: in the buggy output the markup after > is hidden inside a comment; in the spec output it is a real img element. This is a parser-correctness issue, not an exploit report.

Prior acknowledgement

This is the same behavior reported in #119 ("CDATA parsing is not correct when it is not in xmlMode"). The reporter verified against Gumbo and parse5 that <![CDATA[ > <div> xxx </div>]]> should yield a bogus comment <![CDATA[ > followed by live <div>xxx</div> markup, and the maintainer confirmed: "Ahh, right. I forgot it behaved this way. This will be fixed with #114." #114 (the high5 tokenizer rewrite) never landed, so the bug is still present on master.

It also runs counter to the recent spec-alignment work in #2382, #2383, and #2387.

Fix

stateCDATASequence now only takes the real-CDATA path (InCommentLike + ]]>) when CDATA should actually be recognized, via a new shouldRecognizeCDATA() helper that checks xmlMode || recognizeCDATA || isInForeignContext(). In default HTML mode it routes to InSpecialComment (bogus comment ending at >), exactly like the existing partial-match branch a few lines below already does. recognizeCDATA is read from the options already passed to the tokenizer.

xmlMode, foreign-content (SVG/MathML), and recognizeCDATA real-CDATA behavior is preserved.

Tests

Added to src/index.spec.ts:

  • a red-baseline assertion that <![CDATA[x><img src=x> in HTML mode yields a comment ending at > plus a live img element (mirrors the parse5 tree above; fails on master, passes with this change);
  • two regression guards confirming real CDATA in foreign content and in xmlMode is still recognized.

Full suite green:

Test Files  7 passed (7)
     Tests  190 passed (190)

No snapshot files changed, so existing CDATA tokenizer cases (<![CDATA[ foo ]]>, <![CD>, unclosed-at-EOF, recognizeCDATA edge cases) keep their current behavior.

Summary by CodeRabbit

  • New Features

    • Added support for optionally recognizing CDATA sections during parsing.
    • Improved handling of <![CDATA[ so it behaves correctly in XML mode and foreign content.
  • Bug Fixes

    • In HTML mode, <![CDATA[ is now treated as a bogus comment outside foreign content, preventing unintended parsing of following markup.
    • Preserved CDATA content more reliably in supported parsing contexts.

In HTML mode `<![CDATA[` outside foreign content must be tokenized as a
bogus comment that ends at the first `>` (WHATWG HTML §13.2.5.42 Markup
declaration open state, §13.2.5.43 Bogus comment state). The tokenizer
instead entered the comment-like CDATA path on every full `CDATA[`
match, so the section only ended at `]]>` (or EOF) and swallowed all
following markup into one inert comment.

`stateCDATASequence` now only takes the real-CDATA path when CDATA
should be recognized (`xmlMode`, `recognizeCDATA`, or foreign content),
mirroring the existing partial-match branch which already routes to a
bogus comment in HTML mode. xmlMode, foreign-content, and
`recognizeCDATA` behavior is unchanged.

Before: `<![CDATA[x><img>` -> comment `[CDATA[x><img>` (img hidden).
After:  `<![CDATA[x><img>` -> comment `[CDATA[x` + live `<img>`,
matching parse5 and browsers.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a0add854-b435-4e92-a30b-e8e7e247e965

📥 Commits

Reviewing files that changed from the base of the PR and between 4214370 and 50d9a65.

📒 Files selected for processing (2)
  • src/Tokenizer.ts
  • src/index.spec.ts

📝 Walkthrough

Walkthrough

Adds a recognizeCDATA option to the Tokenizer constructor. A new shouldRecognizeCDATA() helper determines CDATA recognition based on XML mode, the new option, or foreign context. The CDATA sequence handler now routes to either real CDATA or bogus comment state accordingly. Tests cover all three paths.

Changes

CDATA Recognition

Layer / File(s) Summary
recognizeCDATA option, helper, and state branching
src/Tokenizer.ts
Adds recognizeCDATA?: boolean constructor option (default false), introduces shouldRecognizeCDATA() checking XML mode, the option, or isInForeignContext(), and branches stateCDATASequence between CDATA and bogus-comment paths.
CDATA parsing tests
src/index.spec.ts
Adds three parseDocument tests: HTML-mode bogus comment terminating at >, foreign-content real CDATA, and xmlMode real CDATA.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A rabbit hopped through angle brackets one day,
Found CDATA lurking in a foreign display.
"Bogus comment or real?" it asked with a twitch—
A flag in the options now handles that switch.
🐇 recognizeCDATA: true — and away! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: HTML-mode handling of <![CDATA[ as a bogus comment.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant