Experiment: Port MySQL-on-SQLite to LALR(1) parser#432
Draft
JanJakes wants to merge 25 commits into
Draft
Conversation
Contributor
🤖 Lexer benchmarkChanges to lexer-related files were detected and triggered a benchmark:
Note: Hosted runners are noisy, and absolute numbers vary. Treat the results with caution and verify them locally. To reproduce locally: |
6ede829 to
bee436b
Compare
bee436b to
076e3db
Compare
076e3db to
40a90cf
Compare
40a90cf to
e75997a
Compare
e75997a to
982dd0f
Compare
Add a new monorepo package for a MySQL parser generated from the official MySQL grammar. This commit sets up the package metadata; the source, tooling, and documentation follow in later commits.
Bring the MySQL lexer and the token and node classes over from the mysql-on-sqlite package unchanged, so the later adaptation to the official grammar is reviewable as a focused diff, and register src/ as the package Composer classmap (the WordPress-style file names rule out PSR-4).
982dd0f to
7730323
Compare
7730323 to
9504ce3
Compare
9504ce3 to
d5ddec4
Compare
d5ddec4 to
b59b50b
Compare
b59b50b to
9aa6c4e
Compare
9aa6c4e to
00e9a3a
Compare
Compile the grammar from the official MySQL sources: fetch sql_yacc.yy and lex.h at a pinned, checksum-verified mysql-server tag; run a pinned Bison build (Docker, version-asserted) to produce the automaton; compact the automaton into plain PHP ACTION/GOTO tables (about 7% of the dense cells); and derive the keyword table and token constants from lex.h, failing the build on any unresolved terminal. bin/build-grammar (composer run build-grammar) runs the pipeline end to end.
a9f8619 to
aca2c38
Compare
00e9a3a to
187593a
Compare
Commit the LALR(1) parse table produced by bin/build-grammar: a plain PHP array that compacts the grammar's dense ACTION/GOTO automaton to about 7% of its cells. Regenerate with composer run build-grammar. The token-level data (keyword table, paren-gated function keywords, and token constants) is generated into the lexer itself; see the next commit.
Make the lexer emit the grammar's own token numbers, with the keyword table generated from lex.h: keyword synonyms, paren-gated function keywords, and dropped keywords all follow MySQL's own data. Diagnostic token names are derived on demand instead of shipping a name map. The lexer produces MySQL's grammar token stream directly, the way MySQL's own lexer does, rather than scanning a different token model and reconciling it in a separate pass: "@" is a standalone terminal followed by its name, "WITH ROLLUP" is contracted via a one-token lookahead, NOT becomes NOT2 under HIGH_NOT_PRECEDENCE, and the input ends with END_OF_INPUT and Bison's end marker (omitted on invalid input). The pull iterator (next_token/get_token) and remaining_tokens() both yield this single stream; the scanner's internal sentinels stay private and never reach it.
A table-driven LALR(1) shift-reduce runtime (WP_Parser) over a WP_Parser_Grammar that expands a compact, generated ACTION/GOTO parse table, building a WP_Parser_Node AST. The grammar is unambiguous for LALR(1), so the loop is deterministic, with no conflict handling or backtracking. A rule that matches nothing produces no node, so empty optional rules are absent from the tree. This is grammar-agnostic: it knows nothing about MySQL, only how to run an LALR(1) parse table. Adapt the copied parse-tree primitives to the package: the runtime builds each node in a single step, so the old recursive parser's merge_fragment() is dropped, and the node and token docblocks no longer reference that parser.
Wire the generated MySQL parse table into the generic LALR(1) runtime through a factory: WP_MySQL_Parser_Factory::create_parser() builds a WP_Parser over a WP_Parser_Grammar loaded from src/mysql-parse-table.php. The grammar is expanded once and shared between created parsers; create_grammar() exposes a fresh grammar for callers that want their own. This is the only piece that knows the parser is being used for MySQL.
Cut the generated parse table from 190 KB to 177 KB (-7%) with no behavior change: most shifts on a given terminal go to the same successor state, so those cells are stored as bare token lists (action_row_shift_tokens) and restored from a per-terminal target table (action_shift_targets) when the grammar is constructed. The smaller file also parses faster on a cold opcache.
Bring the query corpus extracted from the MySQL server test suite, with the tooling that generates it, into the package: data/mysql-server-query-corpus/ plus a bin/build-corpus orchestrator (composer run build-corpus) that fetches the mysql-test directory at the pinned tag and extracts the queries. The SQLite driver package keeps its own copy for now; it will be retired when the driver is ported to this package.
Measure the corpus parse rate and end-to-end (lex + parse) throughput, with warmup and timed passes. The parser accepts 99.76% of the ~69k corpus queries.
Cover the token stream, the scanner (the exhaustive unit suite ported from the SQLite driver), the parser runtime, token value and name resolution, generated grammar-data invariants, and a corpus regression test pinning the exact acceptance tally. Run the suite on the oldest and newest supported PHP versions in CI.
Tokenizing a whole statement routed every token through the pull iterator (next_token -> produce -> scan_lexeme -> read_next_token -> enqueue_token), adding ~4 method calls plus token-queue bookkeeping per token over a plain scan-and-emit loop. Give remaining_tokens() a tight fast path that emits the common single-token lexemes inline and delegates only the rare multi-token ones (@, WITH ROLLUP, end markers) to the buffered producers. The pull API is unchanged and the output is byte-identical; ~24% faster (no JIT) / ~16% (JIT) end-to-end over the MySQL server corpus.
The pull iterator buffered produced tokens in a dynamic $token_queue drained by index. A scan step yields at most two grammar tokens, so a single $pending_token slot suffices: next_token() returns the first and holds the second. The multi-token producers (@, WITH ROLLUP, end markers) now append to a caller-supplied array, shared directly by both next_token() and remaining_tokens() — removing the queue bookkeeping and the duplicated drain in the fast path. A make_token() helper unifies token construction. Output is byte-identical and throughput is unchanged (the multi-token cases were already off the hot path); this is a structural cleanup.
The final backslash-stripping step used preg_replace() with the "u" (UTF-8) modifier. That modifier makes PCRE validate the whole subject as UTF-8 and return null on the first invalid byte; since get_value() is typed ": string", the null turned into a fatal TypeError. MySQL string literals may legitimately carry non-UTF-8 bytes (binary or other-charset payloads), and the lexer scans them at the byte level, so reading the value of such a literal crashed. Switch the modifier to "s" (DOTALL). A byte-wise strip is binary-safe, yields identical results for valid UTF-8 (no continuation byte is a backslash), and additionally handles a backslash preceding a newline byte.
WP_MySQL_Token::get_value() routed backtick-quoted identifiers through the same backslash-unescaping path as string literals. In MySQL a backslash is never an escape inside `...` identifiers; only a doubled backtick is. As a result an identifier such as `a\nb` came back as "a<newline>b", silently corrupting any table, column, or alias name that contains a backslash. Treat backtick identifiers like the NO_BACKSLASH_ESCAPES path: collapse only the doubled bounding backtick and keep every other byte literal.
When resolving a function keyword (SYM_FN), the lexer peeks for a following "("
and, under SQL_MODE_IGNORE_SPACE, skips intervening whitespace first. It skipped
by advancing bytes_already_read and never restored it. When no "(" followed, the
keyword was emitted as an IDENTIFIER whose length — derived from
bytes_already_read in produce() — now covered the trailing whitespace, so the
extracted value was e.g. "COUNT " instead of "COUNT". Under this ANSI-style mode
a column or table named after a function would resolve to the wrong identifier.
Peek with a local index instead of mutating bytes_already_read, so the token's
byte range ends at the keyword and the next scan consumes the whitespace.
read_mysql_comment() read at most five version digits, so a six-digit MMmmrr version comment — added in MySQL 8.4 — was misparsed: /*!100000 ... */ gated as version 10000 instead of 100000, and the sixth digit of /*!080400 ... */ leaked into the comment body as SQL. Mirror MySQL's own lexer rule (sql/sql_lex.cc): the first five characters must be digits; a sixth digit immediately followed by whitespace extends the version to six digits; otherwise the version stays five digits and any extra is content.
Left-recursive grammar list rules nest through their own rule name
("list: list ',' item | item"). The new accessor collects child nodes
of the whole nested chain in source order, as if the list were flat,
which is how AST consumers want to iterate list items.
With the ANSI_QUOTES SQL mode, MySQL treats double-quoted text as a quoted identifier instead of a string literal. Emit an identifier token for it, so identifier positions accept double-quoted names.
Replace the hand-written recursive parser with the table-driven LALR(1) parser generated from MySQL's official grammar, consumed as a Composer dependency: - Require wordpress/mysql-parser, resolved from the monorepo sibling package via a Composer path repository, and load it through the Composer autoloader in the driver loader. - Drop the old parser machinery (WP_Parser, WP_Parser_Grammar, the lexer, the parse tree classes, and mysql-grammar.php), all provided by the parser package now, and the native parser fork, which is bound to the old grammar contract. - Parse multi-statement input by splitting the token stream on top-level ';' separators, as the grammar parses a single statement (this is how MySQL clients split multi-statement input). - Re-key the statement dispatch to the sql_yacc.yy rule names and map keyword token constants to the grammar keyword table. The translation layer still needs to be ported to the new AST shapes.
Re-key the SQL-to-SQLite translation from the old hand-written grammar to the sql_yacc.yy rule names and tree shapes: - Rewrite the translate() special cases and per-statement handlers (SELECT, INSERT/REPLACE, UPDATE, DELETE, DDL, SHOW, SET, USE, transactions and locking, administration statements). - Iterate grammar lists with the flattened child node accessor, as lists are left-recursive in the new grammar. - Walk JOINs recursively when building the table reference map, as joins nest through the left operand in the new grammar. - Retry parsing with the ANSI_QUOTES SQL mode when a query fails to parse. MySQL rejects double-quoted identifiers without ANSI_QUOTES, but WordPress relies on them (dbDelta can produce double-quoted index names) and the previous parser accepted them.
Re-key CREATE TABLE, ALTER TABLE, and index statement analysis to the sql_yacc.yy rule names and tree shapes. The recorded information schema rows are unchanged: a battery of DDL statements covering all supported data types, constraints, indexes, and table options produces the exact same rows as the previous parser and builder. Multi-column ADD COLUMN (a INT, b INT) is now recorded correctly; the previous builder crashed on it.
The lexer, parser, token data, and parse tree classes are tested in the wordpress/mysql-parser package now: - Remove the lexer and parser test suites from the driver package (the corpus data stays here; the parser package corpus test reads it from the sibling package and skips when it is not available). - Move the parse tree node tests to the parser package and cover the new flattened child node accessor. - Remove the native parser extension tests and tools, which are bound to the old grammar contract. - Update the AST dump and benchmark tools to the new parser API.
187593a to
816ee66
Compare
The SQLite driver now loads the MySQL parser as a Composer dependency, and the native parser extension bound to the old grammar is gone: - Install the driver Composer dependencies in the WordPress test setup and mount the package vendor directory and the parser package into the WordPress containers. - Bundle the driver's production Composer dependencies into the plugin zip, resolving the path-repository symlink into a real copy of the parser package. - Run the driver test workflow against changes to the parser package and drop the native parser extension jobs and setup scripts. - Install the driver Composer dependencies in the lexer benchmark workflow.
816ee66 to
47a1035
Compare
adamziel
pushed a commit
that referenced
this pull request
Jun 23, 2026
> [!NOTE] > The changed line numbers are misleading—about 115,000 added lines is just a testing query corpus. > (Copied to the new `mysql-parser` package from `mysql-on-sqlite`.) ## LALR(1) parser from official MySQL grammar A new experimental **`packages/mysql-parser` package** that implements a universal **LALR(1) parser** and builds a MySQL parse table from the **official MySQL grammar**. This is the initial implementation, not used anywhere in the driver yet. A full driver migration to this new parser is AI-prototyped in #432. ### What it does - **Grammar processing pipeline:** Fetch sources → Bison → generate parse table and token data. - **Lexer:** The existing MySQL lexer was copied and adapted to the new LALR(1) grammar. - **Parser:** A new universal LALR(1) parser implementation. - **MySQL grammar:** A compacted MySQL 8.4 LTS grammar, extracted using the grammar processing pipeline. - **MySQL query corpus:** The ~70k MySQL query corpus was copied and updated to MySQL 8.4 LTS. - **Benchmark:** A no-JIT/JIT lexer + parser benchmark. - **Test suite:** New tests and a CI job. ### What it doesn't do yet - **Replace the current parser:** It's a standalone package that doesn't replace the existing parser yet. - **Multi-version:** For now, the parser only tracks MySQL 8.4 LTS. Multi-version will be done as a follow-up. ### Benchmarks Measured on MacBook Pro M4 Max on PHP 8.4, the package's 8.4.10 corpus ~70k queries, end-to-end (lex + parse), best of 5 timed passes after 2 warmups: | Metric | LL (trunk) | LALR (this) | | --- | --- | --- | | Throughput, no JIT | 11,010 QPS | **59,457 QPS** | | Throughput, warm JIT | 24,393 QPS | **112,759 QPS** | | Cold boot, no opcache | **~1.9 ms** | ~2.7 ms | | Warm boot, opcache | ~0.6 ms | **~0.3 ms** | | Memory, no opcache | **~3.4 MB** | ~5.4 MB | | Memory, opcache worker | **~1.8 MB** | ~3.1 MB | | Generated parser/table file size | **65 KB** | 177 KB | | Full size (lexer + parser + grammar) | **246 KB** | 260 KB | This parser is over **5× faster** without JIT and over **4.5× faster** with JIT. Cold boot is a bit slower; warm boot is faster. The memory footprint is a bit higher, and the overall size about 14 KB higher. #### Recognize-only The same lex+parse runs but building **no AST**, measuring only recognition without AST allocation: | Throughput | LL (trunk) | LALR (this) | | --- | --- | --- | | no JIT | 16,359 QPS | **95,374 QPS** | | warm JIT | 49,940 QPS | **210,032 QPS** | Dropping AST construction lifts both by ~1.5–2×, but the gap stays around **~4.2–5.8×**.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #429. An end-to-end experiment that ports the MySQL-on-SQLite driver from its hand-written recursive parser to the new LALR(1) parser, consumed as a real Composer dependency.
Not meant to merge as-is — it depends on #429 landing first. All tests pass (543 in
mysql-on-sqlite, 131 inmysql-parser, including the MySQL server corpus pin), and the diff is net-negative: it deletes the driver's old parser machinery.What it does
wordpress/mysql-parsergets a classmap autoloader (WordPress file naming rules out PSR-4) and exposesWP_MySQL_Parser::PARSE_TABLE_PATHfor the generated parse table. The driver requires it through a Composer path repository — a vendor symlink, so there's one source of truth and nothing is duplicated. The driver's old parser machinery (grammar, lexer, parse-tree classes, and the native Rust parser fork) is removed entirely.sql_yacc.yyrule names and tree shapes. Multi-statement input is split on top-level;(the grammar parses one statement, the way MySQL clients do);create_parser()/next_query()becomesparse_mysql_query(). The info-schema builder is verified byte-exact against the old builder across a DDL battery (data types, constraints, indexes, table options); multi-columnADD COLUMN (a INT, b INT), which crashed the old builder, is now recorded correctly.opt_*, Bison mid-rule$@N) produce no AST nodes, so consumers see an optional clause only when it's present.WP_Parser_Node::get_flattened_child_nodes()iterates left-recursive grammar lists (list: list ',' item) as if flat.packages/php-ext-wp-mysql-parseris orphaned by this branch).What it doesn't do yet
Testing