Experiment: Port MySQL-on-SQLite to LALR(1) parser by JanJakes · Pull Request #432 · WordPress/sqlite-database-integration

JanJakes · 2026-06-12T07:32:48Z

Stacked on #429. An end-to-end experiment that ports the MySQL-on-SQLite driver from its hand-written recursive parser to the new LALR(1) parser, consumed as a real Composer dependency.

Not meant to merge as-is — it depends on #429 landing first. All tests pass (543 in mysql-on-sqlite, 131 in mysql-parser, including the MySQL server corpus pin), and the diff is net-negative: it deletes the driver's old parser machinery.

What it does

Package reuse via Composer. wordpress/mysql-parser gets a classmap autoloader (WordPress file naming rules out PSR-4) and exposes WP_MySQL_Parser::PARSE_TABLE_PATH for the generated parse table. The driver requires it through a Composer path repository — a vendor symlink, so there's one source of truth and nothing is duplicated. The driver's old parser machinery (grammar, lexer, parse-tree classes, and the native Rust parser fork) is removed entirely.
Driver port. Statement dispatch, the query translation layer, and the information-schema builder are re-keyed to the official sql_yacc.yy rule names and tree shapes. Multi-statement input is split on top-level ; (the grammar parses one statement, the way MySQL clients do); create_parser()/next_query() becomes parse_mysql_query(). The info-schema builder is verified byte-exact against the old builder across a DDL battery (data types, constraints, indexes, table options); multi-column ADD COLUMN (a INT, b INT), which crashed the old builder, is now recorded correctly.
Parser refinements the port surfaced.
- Empty reductions (opt_*, Bison mid-rule $@N) produce no AST nodes, so consumers see an optional clause only when it's present.
- WP_Parser_Node::get_flattened_child_nodes() iterates left-recursive grammar lists (list: list ',' item) as if flat.
- ANSI_QUOTES lexer mode plus a driver-side parse retry, for double-quoted identifiers that WordPress emits (e.g. dbDelta) but MySQL rejects without the mode.
Deployment & CI. Docker environments install the driver's Composer deps and mount the package; the plugin-zip build resolves the path symlink into a pruned copy; the driver workflow also triggers on parser-package changes; the native parser extension jobs are removed (packages/php-ext-wp-mysql-parser is orphaned by this branch).

What it doesn't do yet

Merge: it's a prototype stacked on LALR(1) parser from official MySQL grammar #429, which needs to land first.
Multi-version: tracks MySQL 8.4 LTS only (inherited from LALR(1) parser from official MySQL grammar #429).

Testing

cd packages/mysql-on-sqlite && composer install && composer run test
cd packages/mysql-parser && composer install && composer run test
composer run build-sqlite-plugin-zip

github-actions · 2026-06-12T07:33:49Z

🤖 Lexer benchmark

Changes to lexer-related files were detected and triggered a benchmark:

Config	Base (QPS)	This PR (QPS)	Speedup
no JIT	70,879	72,307	1.02×
tracing JIT	160,210	189,967	1.19×

Note: Hosted runners are noisy, and absolute numbers vary. Treat the results with caution and verify them locally.

To reproduce locally:

cd packages/mysql-on-sqlite && composer run bench-lexer

Add a new monorepo package for a MySQL parser generated from the official MySQL grammar. This commit sets up the package metadata; the source, tooling, and documentation follow in later commits.

Bring the MySQL lexer and the token and node classes over from the mysql-on-sqlite package unchanged, so the later adaptation to the official grammar is reviewable as a focused diff, and register src/ as the package Composer classmap (the WordPress-style file names rule out PSR-4).

Compile the grammar from the official MySQL sources: fetch sql_yacc.yy and lex.h at a pinned, checksum-verified mysql-server tag; run a pinned Bison build (Docker, version-asserted) to produce the automaton; compact the automaton into plain PHP ACTION/GOTO tables (about 7% of the dense cells); and derive the keyword table and token constants from lex.h, failing the build on any unresolved terminal. bin/build-grammar (composer run build-grammar) runs the pipeline end to end.

Commit the LALR(1) parse table produced by bin/build-grammar: a plain PHP array that compacts the grammar's dense ACTION/GOTO automaton to about 7% of its cells. Regenerate with composer run build-grammar. The token-level data (keyword table, paren-gated function keywords, and token constants) is generated into the lexer itself; see the next commit.

Make the lexer emit the grammar's own token numbers, with the keyword table generated from lex.h: keyword synonyms, paren-gated function keywords, and dropped keywords all follow MySQL's own data. Diagnostic token names are derived on demand instead of shipping a name map. The lexer produces MySQL's grammar token stream directly, the way MySQL's own lexer does, rather than scanning a different token model and reconciling it in a separate pass: "@" is a standalone terminal followed by its name, "WITH ROLLUP" is contracted via a one-token lookahead, NOT becomes NOT2 under HIGH_NOT_PRECEDENCE, and the input ends with END_OF_INPUT and Bison's end marker (omitted on invalid input). The pull iterator (next_token/get_token) and remaining_tokens() both yield this single stream; the scanner's internal sentinels stay private and never reach it.

A table-driven LALR(1) shift-reduce runtime (WP_Parser) over a WP_Parser_Grammar that expands a compact, generated ACTION/GOTO parse table, building a WP_Parser_Node AST. The grammar is unambiguous for LALR(1), so the loop is deterministic, with no conflict handling or backtracking. A rule that matches nothing produces no node, so empty optional rules are absent from the tree. This is grammar-agnostic: it knows nothing about MySQL, only how to run an LALR(1) parse table. Adapt the copied parse-tree primitives to the package: the runtime builds each node in a single step, so the old recursive parser's merge_fragment() is dropped, and the node and token docblocks no longer reference that parser.

Wire the generated MySQL parse table into the generic LALR(1) runtime through a factory: WP_MySQL_Parser_Factory::create_parser() builds a WP_Parser over a WP_Parser_Grammar loaded from src/mysql-parse-table.php. The grammar is expanded once and shared between created parsers; create_grammar() exposes a fresh grammar for callers that want their own. This is the only piece that knows the parser is being used for MySQL.

Cut the generated parse table from 190 KB to 177 KB (-7%) with no behavior change: most shifts on a given terminal go to the same successor state, so those cells are stored as bare token lists (action_row_shift_tokens) and restored from a per-terminal target table (action_shift_targets) when the grammar is constructed. The smaller file also parses faster on a cold opcache.

Bring the query corpus extracted from the MySQL server test suite, with the tooling that generates it, into the package: data/mysql-server-query-corpus/ plus a bin/build-corpus orchestrator (composer run build-corpus) that fetches the mysql-test directory at the pinned tag and extracts the queries. The SQLite driver package keeps its own copy for now; it will be retired when the driver is ported to this package.

Measure the corpus parse rate and end-to-end (lex + parse) throughput, with warmup and timed passes. The parser accepts 99.76% of the ~69k corpus queries.

Cover the token stream, the scanner (the exhaustive unit suite ported from the SQLite driver), the parser runtime, token value and name resolution, generated grammar-data invariants, and a corpus regression test pinning the exact acceptance tally. Run the suite on the oldest and newest supported PHP versions in CI.

Tokenizing a whole statement routed every token through the pull iterator (next_token -> produce -> scan_lexeme -> read_next_token -> enqueue_token), adding ~4 method calls plus token-queue bookkeeping per token over a plain scan-and-emit loop. Give remaining_tokens() a tight fast path that emits the common single-token lexemes inline and delegates only the rare multi-token ones (@, WITH ROLLUP, end markers) to the buffered producers. The pull API is unchanged and the output is byte-identical; ~24% faster (no JIT) / ~16% (JIT) end-to-end over the MySQL server corpus.

The pull iterator buffered produced tokens in a dynamic $token_queue drained by index. A scan step yields at most two grammar tokens, so a single $pending_token slot suffices: next_token() returns the first and holds the second. The multi-token producers (@, WITH ROLLUP, end markers) now append to a caller-supplied array, shared directly by both next_token() and remaining_tokens() — removing the queue bookkeeping and the duplicated drain in the fast path. A make_token() helper unifies token construction. Output is byte-identical and throughput is unchanged (the multi-token cases were already off the hot path); this is a structural cleanup.

The final backslash-stripping step used preg_replace() with the "u" (UTF-8) modifier. That modifier makes PCRE validate the whole subject as UTF-8 and return null on the first invalid byte; since get_value() is typed ": string", the null turned into a fatal TypeError. MySQL string literals may legitimately carry non-UTF-8 bytes (binary or other-charset payloads), and the lexer scans them at the byte level, so reading the value of such a literal crashed. Switch the modifier to "s" (DOTALL). A byte-wise strip is binary-safe, yields identical results for valid UTF-8 (no continuation byte is a backslash), and additionally handles a backslash preceding a newline byte.

WP_MySQL_Token::get_value() routed backtick-quoted identifiers through the same backslash-unescaping path as string literals. In MySQL a backslash is never an escape inside `...` identifiers; only a doubled backtick is. As a result an identifier such as `a\nb` came back as "a<newline>b", silently corrupting any table, column, or alias name that contains a backslash. Treat backtick identifiers like the NO_BACKSLASH_ESCAPES path: collapse only the doubled bounding backtick and keep every other byte literal.

When resolving a function keyword (SYM_FN), the lexer peeks for a following "(" and, under SQL_MODE_IGNORE_SPACE, skips intervening whitespace first. It skipped by advancing bytes_already_read and never restored it. When no "(" followed, the keyword was emitted as an IDENTIFIER whose length — derived from bytes_already_read in produce() — now covered the trailing whitespace, so the extracted value was e.g. "COUNT " instead of "COUNT". Under this ANSI-style mode a column or table named after a function would resolve to the wrong identifier. Peek with a local index instead of mutating bytes_already_read, so the token's byte range ends at the keyword and the next scan consumes the whitespace.

read_mysql_comment() read at most five version digits, so a six-digit MMmmrr version comment — added in MySQL 8.4 — was misparsed: /*!100000 ... */ gated as version 10000 instead of 100000, and the sixth digit of /*!080400 ... */ leaked into the comment body as SQL. Mirror MySQL's own lexer rule (sql/sql_lex.cc): the first five characters must be digits; a sixth digit immediately followed by whitespace extends the version to six digits; otherwise the version stays five digits and any extra is content.

Left-recursive grammar list rules nest through their own rule name ("list: list ',' item | item"). The new accessor collects child nodes of the whole nested chain in source order, as if the list were flat, which is how AST consumers want to iterate list items.

With the ANSI_QUOTES SQL mode, MySQL treats double-quoted text as a quoted identifier instead of a string literal. Emit an identifier token for it, so identifier positions accept double-quoted names.

Replace the hand-written recursive parser with the table-driven LALR(1) parser generated from MySQL's official grammar, consumed as a Composer dependency: - Require wordpress/mysql-parser, resolved from the monorepo sibling package via a Composer path repository, and load it through the Composer autoloader in the driver loader. - Drop the old parser machinery (WP_Parser, WP_Parser_Grammar, the lexer, the parse tree classes, and mysql-grammar.php), all provided by the parser package now, and the native parser fork, which is bound to the old grammar contract. - Parse multi-statement input by splitting the token stream on top-level ';' separators, as the grammar parses a single statement (this is how MySQL clients split multi-statement input). - Re-key the statement dispatch to the sql_yacc.yy rule names and map keyword token constants to the grammar keyword table. The translation layer still needs to be ported to the new AST shapes.

Re-key the SQL-to-SQLite translation from the old hand-written grammar to the sql_yacc.yy rule names and tree shapes: - Rewrite the translate() special cases and per-statement handlers (SELECT, INSERT/REPLACE, UPDATE, DELETE, DDL, SHOW, SET, USE, transactions and locking, administration statements). - Iterate grammar lists with the flattened child node accessor, as lists are left-recursive in the new grammar. - Walk JOINs recursively when building the table reference map, as joins nest through the left operand in the new grammar. - Retry parsing with the ANSI_QUOTES SQL mode when a query fails to parse. MySQL rejects double-quoted identifiers without ANSI_QUOTES, but WordPress relies on them (dbDelta can produce double-quoted index names) and the previous parser accepted them.

Re-key CREATE TABLE, ALTER TABLE, and index statement analysis to the sql_yacc.yy rule names and tree shapes. The recorded information schema rows are unchanged: a battery of DDL statements covering all supported data types, constraints, indexes, and table options produces the exact same rows as the previous parser and builder. Multi-column ADD COLUMN (a INT, b INT) is now recorded correctly; the previous builder crashed on it.

The lexer, parser, token data, and parse tree classes are tested in the wordpress/mysql-parser package now: - Remove the lexer and parser test suites from the driver package (the corpus data stays here; the parser package corpus test reads it from the sibling package and skips when it is not available). - Move the parse tree node tests to the parser package and cover the new flattened child node accessor. - Remove the native parser extension tests and tools, which are bound to the old grammar contract. - Update the AST dump and benchmark tools to the new parser API.

The SQLite driver now loads the MySQL parser as a Composer dependency, and the native parser extension bound to the old grammar is gone: - Install the driver Composer dependencies in the WordPress test setup and mount the package vendor directory and the parser package into the WordPress containers. - Bundle the driver's production Composer dependencies into the plugin zip, resolving the path-repository symlink into a real copy of the parser package. - Run the driver test workflow against changes to the parser package and drop the native parser extension jobs and setup scripts. - Install the driver Composer dependencies in the lexer benchmark workflow.

> [!NOTE] > The changed line numbers are misleading—about 115,000 added lines is just a testing query corpus. > (Copied to the new `mysql-parser` package from `mysql-on-sqlite`.) ## LALR(1) parser from official MySQL grammar A new experimental **`packages/mysql-parser` package** that implements a universal **LALR(1) parser** and builds a MySQL parse table from the **official MySQL grammar**. This is the initial implementation, not used anywhere in the driver yet. A full driver migration to this new parser is AI-prototyped in #432. ### What it does - **Grammar processing pipeline:** Fetch sources → Bison → generate parse table and token data. - **Lexer:** The existing MySQL lexer was copied and adapted to the new LALR(1) grammar. - **Parser:** A new universal LALR(1) parser implementation. - **MySQL grammar:** A compacted MySQL 8.4 LTS grammar, extracted using the grammar processing pipeline. - **MySQL query corpus:** The ~70k MySQL query corpus was copied and updated to MySQL 8.4 LTS. - **Benchmark:** A no-JIT/JIT lexer + parser benchmark. - **Test suite:** New tests and a CI job. ### What it doesn't do yet - **Replace the current parser:** It's a standalone package that doesn't replace the existing parser yet. - **Multi-version:** For now, the parser only tracks MySQL 8.4 LTS. Multi-version will be done as a follow-up. ### Benchmarks Measured on MacBook Pro M4 Max on PHP 8.4, the package's 8.4.10 corpus ~70k queries, end-to-end (lex + parse), best of 5 timed passes after 2 warmups: | Metric | LL (trunk) | LALR (this) | | --- | --- | --- | | Throughput, no JIT | 11,010 QPS | **59,457 QPS** | | Throughput, warm JIT | 24,393 QPS | **112,759 QPS** | | Cold boot, no opcache | **~1.9 ms** | ~2.7 ms | | Warm boot, opcache | ~0.6 ms | **~0.3 ms** | | Memory, no opcache | **~3.4 MB** | ~5.4 MB | | Memory, opcache worker | **~1.8 MB** | ~3.1 MB | | Generated parser/table file size | **65 KB** | 177 KB | | Full size (lexer + parser + grammar) | **246 KB** | 260 KB | This parser is over **5× faster** without JIT and over **4.5× faster** with JIT. Cold boot is a bit slower; warm boot is faster. The memory footprint is a bit higher, and the overall size about 14 KB higher. #### Recognize-only The same lex+parse runs but building **no AST**, measuring only recognition without AST allocation: | Throughput | LL (trunk) | LALR (this) | | --- | --- | --- | | no JIT | 16,359 QPS | **95,374 QPS** | | warm JIT | 49,940 QPS | **210,032 QPS** | Dropping AST construction lifts both by ~1.5–2×, but the gap stays around **~4.2–5.8×**.

JanJakes force-pushed the lalr-parser-driver branch 4 times, most recently from 6ede829 to bee436b Compare June 12, 2026 09:01

JanJakes force-pushed the lalr-parser branch from 14dfcab to 3e18388 Compare June 12, 2026 09:49

JanJakes force-pushed the lalr-parser-driver branch from bee436b to 076e3db Compare June 12, 2026 09:50

JanJakes force-pushed the lalr-parser branch from 3e18388 to 54e0311 Compare June 12, 2026 14:37

JanJakes force-pushed the lalr-parser-driver branch from 076e3db to 40a90cf Compare June 12, 2026 14:39

JanJakes force-pushed the lalr-parser branch from 54e0311 to b3c39da Compare June 12, 2026 14:55

JanJakes force-pushed the lalr-parser-driver branch from 40a90cf to e75997a Compare June 12, 2026 14:56

JanJakes force-pushed the lalr-parser branch from b3c39da to 1f88932 Compare June 12, 2026 15:27

JanJakes force-pushed the lalr-parser-driver branch from e75997a to 982dd0f Compare June 12, 2026 15:28

JanJakes added 2 commits June 12, 2026 21:05

Scaffold the mysql-parser package

2cfc1d3

Add a new monorepo package for a MySQL parser generated from the official MySQL grammar. This commit sets up the package metadata; the source, tooling, and documentation follow in later commits.

JanJakes force-pushed the lalr-parser branch from 1f88932 to a90ea51 Compare June 12, 2026 19:08

JanJakes force-pushed the lalr-parser-driver branch from 982dd0f to 7730323 Compare June 12, 2026 19:09

JanJakes force-pushed the lalr-parser branch from a90ea51 to b8fb251 Compare June 12, 2026 19:19

JanJakes force-pushed the lalr-parser-driver branch from 7730323 to 9504ce3 Compare June 12, 2026 19:19

JanJakes force-pushed the lalr-parser branch from b8fb251 to f3f0935 Compare June 12, 2026 20:34

JanJakes force-pushed the lalr-parser-driver branch from 9504ce3 to d5ddec4 Compare June 12, 2026 20:36

JanJakes force-pushed the lalr-parser branch from f3f0935 to 24d4f21 Compare June 13, 2026 13:07

JanJakes force-pushed the lalr-parser-driver branch from d5ddec4 to b59b50b Compare June 13, 2026 13:08

JanJakes force-pushed the lalr-parser branch from 24d4f21 to 296a9c5 Compare June 13, 2026 13:17

JanJakes force-pushed the lalr-parser-driver branch from b59b50b to 9aa6c4e Compare June 13, 2026 13:19

JanJakes force-pushed the lalr-parser branch from 296a9c5 to 0f841c5 Compare June 13, 2026 14:10

JanJakes force-pushed the lalr-parser-driver branch from 9aa6c4e to 00e9a3a Compare June 13, 2026 14:10

JanJakes force-pushed the lalr-parser branch 2 times, most recently from a9f8619 to aca2c38 Compare June 19, 2026 09:22

JanJakes force-pushed the lalr-parser-driver branch from 00e9a3a to 187593a Compare June 19, 2026 09:35

JanJakes mentioned this pull request Jun 19, 2026

LALR(1) parser from official MySQL grammar #429

Merged

JanJakes added 11 commits June 19, 2026 15:36

Add the corpus benchmark

85c4b9a

Measure the corpus parse rate and end-to-end (lex + parse) throughput, with warmup and timed passes. The parser accepts 99.76% of the ~69k corpus queries.

Add README.md

e347cb2

JanJakes force-pushed the lalr-parser branch from aca2c38 to 91f53a1 Compare June 19, 2026 14:04

JanJakes added 10 commits June 19, 2026 16:28

Support the ANSI_QUOTES SQL mode in the lexer

33cbf06

With the ANSI_QUOTES SQL mode, MySQL treats double-quoted text as a quoted identifier instead of a string literal. Emit an identifier token for it, so identifier positions accept double-quoted names.

JanJakes force-pushed the lalr-parser-driver branch from 187593a to 816ee66 Compare June 19, 2026 14:37

JanJakes force-pushed the lalr-parser-driver branch from 816ee66 to 47a1035 Compare June 19, 2026 14:54

Base automatically changed from lalr-parser to trunk June 23, 2026 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: Port MySQL-on-SQLite to LALR(1) parser#432

Experiment: Port MySQL-on-SQLite to LALR(1) parser#432
JanJakes wants to merge 25 commits into
trunkfrom
lalr-parser-driver

JanJakes commented Jun 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JanJakes commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What it does

What it doesn't do yet

Testing

Uh oh!

github-actions Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤖 Lexer benchmark

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JanJakes commented Jun 12, 2026 •

edited

Loading

github-actions Bot commented Jun 12, 2026 •

edited

Loading