Skip to content

Experiment: Port MySQL-on-SQLite to LALR(1) parser#432

Draft
JanJakes wants to merge 25 commits into
trunkfrom
lalr-parser-driver
Draft

Experiment: Port MySQL-on-SQLite to LALR(1) parser#432
JanJakes wants to merge 25 commits into
trunkfrom
lalr-parser-driver

Conversation

@JanJakes

@JanJakes JanJakes commented Jun 12, 2026

Copy link
Copy Markdown
Member

Stacked on #429. An end-to-end experiment that ports the MySQL-on-SQLite driver from its hand-written recursive parser to the new LALR(1) parser, consumed as a real Composer dependency.

Not meant to merge as-is — it depends on #429 landing first. All tests pass (543 in mysql-on-sqlite, 131 in mysql-parser, including the MySQL server corpus pin), and the diff is net-negative: it deletes the driver's old parser machinery.

What it does

  • Package reuse via Composer. wordpress/mysql-parser gets a classmap autoloader (WordPress file naming rules out PSR-4) and exposes WP_MySQL_Parser::PARSE_TABLE_PATH for the generated parse table. The driver requires it through a Composer path repository — a vendor symlink, so there's one source of truth and nothing is duplicated. The driver's old parser machinery (grammar, lexer, parse-tree classes, and the native Rust parser fork) is removed entirely.
  • Driver port. Statement dispatch, the query translation layer, and the information-schema builder are re-keyed to the official sql_yacc.yy rule names and tree shapes. Multi-statement input is split on top-level ; (the grammar parses one statement, the way MySQL clients do); create_parser()/next_query() becomes parse_mysql_query(). The info-schema builder is verified byte-exact against the old builder across a DDL battery (data types, constraints, indexes, table options); multi-column ADD COLUMN (a INT, b INT), which crashed the old builder, is now recorded correctly.
  • Parser refinements the port surfaced.
    • Empty reductions (opt_*, Bison mid-rule $@N) produce no AST nodes, so consumers see an optional clause only when it's present.
    • WP_Parser_Node::get_flattened_child_nodes() iterates left-recursive grammar lists (list: list ',' item) as if flat.
    • ANSI_QUOTES lexer mode plus a driver-side parse retry, for double-quoted identifiers that WordPress emits (e.g. dbDelta) but MySQL rejects without the mode.
  • Deployment & CI. Docker environments install the driver's Composer deps and mount the package; the plugin-zip build resolves the path symlink into a pruned copy; the driver workflow also triggers on parser-package changes; the native parser extension jobs are removed (packages/php-ext-wp-mysql-parser is orphaned by this branch).

What it doesn't do yet

Testing

cd packages/mysql-on-sqlite && composer install && composer run test
cd packages/mysql-parser && composer install && composer run test
composer run build-sqlite-plugin-zip

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

🤖 Lexer benchmark

Changes to lexer-related files were detected and triggered a benchmark:

Config Base (QPS) This PR (QPS) Speedup
no JIT 70,879 72,307 1.02×
tracing JIT 160,210 189,967 1.19×

Note: Hosted runners are noisy, and absolute numbers vary. Treat the results with caution and verify them locally.

To reproduce locally:

cd packages/mysql-on-sqlite && composer run bench-lexer

@JanJakes JanJakes force-pushed the lalr-parser-driver branch 4 times, most recently from 6ede829 to bee436b Compare June 12, 2026 09:01
@JanJakes JanJakes force-pushed the lalr-parser-driver branch from bee436b to 076e3db Compare June 12, 2026 09:50
@JanJakes JanJakes force-pushed the lalr-parser-driver branch from 076e3db to 40a90cf Compare June 12, 2026 14:39
@JanJakes JanJakes force-pushed the lalr-parser-driver branch from 40a90cf to e75997a Compare June 12, 2026 14:56
@JanJakes JanJakes force-pushed the lalr-parser-driver branch from e75997a to 982dd0f Compare June 12, 2026 15:28
JanJakes added 2 commits June 12, 2026 21:05
Add a new monorepo package for a MySQL parser generated from the official
MySQL grammar. This commit sets up the package metadata; the source, tooling,
and documentation follow in later commits.
Bring the MySQL lexer and the token and node classes over from the
mysql-on-sqlite package unchanged, so the later adaptation to the official
grammar is reviewable as a focused diff, and register src/ as the package
Composer classmap (the WordPress-style file names rule out PSR-4).
Compile the grammar from the official MySQL sources: fetch sql_yacc.yy and
lex.h at a pinned, checksum-verified mysql-server tag; run a pinned Bison
build (Docker, version-asserted) to produce the automaton; compact the
automaton into plain PHP ACTION/GOTO tables (about 7% of the dense cells);
and derive the keyword table and token constants from lex.h, failing the
build on any unresolved terminal. bin/build-grammar (composer run
build-grammar) runs the pipeline end to end.
@JanJakes JanJakes force-pushed the lalr-parser branch 2 times, most recently from a9f8619 to aca2c38 Compare June 19, 2026 09:22
JanJakes added 11 commits June 19, 2026 15:36
Commit the LALR(1) parse table produced by bin/build-grammar: a plain PHP
array that compacts the grammar's dense ACTION/GOTO automaton to about 7% of
its cells. Regenerate with composer run build-grammar.

The token-level data (keyword table, paren-gated function keywords, and token
constants) is generated into the lexer itself; see the next commit.
Make the lexer emit the grammar's own token numbers, with the keyword table
generated from lex.h: keyword synonyms, paren-gated function keywords, and
dropped keywords all follow MySQL's own data. Diagnostic token names are
derived on demand instead of shipping a name map.

The lexer produces MySQL's grammar token stream directly, the way MySQL's own
lexer does, rather than scanning a different token model and reconciling it in
a separate pass: "@" is a standalone terminal followed by its name, "WITH
ROLLUP" is contracted via a one-token lookahead, NOT becomes NOT2 under
HIGH_NOT_PRECEDENCE, and the input ends with END_OF_INPUT and Bison's end
marker (omitted on invalid input). The pull iterator (next_token/get_token)
and remaining_tokens() both yield this single stream; the scanner's internal
sentinels stay private and never reach it.
A table-driven LALR(1) shift-reduce runtime (WP_Parser) over a WP_Parser_Grammar
that expands a compact, generated ACTION/GOTO parse table, building a
WP_Parser_Node AST. The grammar is unambiguous for LALR(1), so the loop is
deterministic, with no conflict handling or backtracking. A rule that matches
nothing produces no node, so empty optional rules are absent from the tree.

This is grammar-agnostic: it knows nothing about MySQL, only how to run an
LALR(1) parse table.

Adapt the copied parse-tree primitives to the package: the runtime builds each
node in a single step, so the old recursive parser's merge_fragment() is
dropped, and the node and token docblocks no longer reference that parser.
Wire the generated MySQL parse table into the generic LALR(1) runtime through a
factory: WP_MySQL_Parser_Factory::create_parser() builds a WP_Parser over a
WP_Parser_Grammar loaded from src/mysql-parse-table.php. The grammar is
expanded once and shared between created parsers; create_grammar() exposes a
fresh grammar for callers that want their own.

This is the only piece that knows the parser is being used for MySQL.
Cut the generated parse table from 190 KB to 177 KB (-7%) with no behavior
change: most shifts on a given terminal go to the same successor state, so
those cells are stored as bare token lists (action_row_shift_tokens) and
restored from a per-terminal target table (action_shift_targets) when the
grammar is constructed. The smaller file also parses faster on a cold opcache.
Bring the query corpus extracted from the MySQL server test suite, with the
tooling that generates it, into the package: data/mysql-server-query-corpus/
plus a bin/build-corpus orchestrator (composer run build-corpus) that
fetches the mysql-test directory at the pinned tag and extracts the queries.
The SQLite driver package keeps its own copy for now; it will be retired
when the driver is ported to this package.
Measure the corpus parse rate and end-to-end (lex + parse) throughput, with
warmup and timed passes. The parser accepts 99.76% of the ~69k corpus
queries.
Cover the token stream, the scanner (the exhaustive unit suite ported from
the SQLite driver), the parser runtime, token value and name resolution,
generated grammar-data invariants, and a corpus regression test pinning the
exact acceptance tally. Run the suite on the oldest and newest supported PHP
versions in CI.
Tokenizing a whole statement routed every token through the pull iterator
(next_token -> produce -> scan_lexeme -> read_next_token -> enqueue_token),
adding ~4 method calls plus token-queue bookkeeping per token over a plain
scan-and-emit loop. Give remaining_tokens() a tight fast path that emits the
common single-token lexemes inline and delegates only the rare multi-token
ones (@, WITH ROLLUP, end markers) to the buffered producers. The pull API is
unchanged and the output is byte-identical; ~24% faster (no JIT) / ~16% (JIT)
end-to-end over the MySQL server corpus.
The pull iterator buffered produced tokens in a dynamic $token_queue drained
by index. A scan step yields at most two grammar tokens, so a single
$pending_token slot suffices: next_token() returns the first and holds the
second. The multi-token producers (@, WITH ROLLUP, end markers) now append to
a caller-supplied array, shared directly by both next_token() and
remaining_tokens() — removing the queue bookkeeping and the duplicated drain
in the fast path. A make_token() helper unifies token construction.

Output is byte-identical and throughput is unchanged (the multi-token cases
were already off the hot path); this is a structural cleanup.
JanJakes added 10 commits June 19, 2026 16:28
The final backslash-stripping step used preg_replace() with the "u" (UTF-8)
modifier. That modifier makes PCRE validate the whole subject as UTF-8 and
return null on the first invalid byte; since get_value() is typed ": string",
the null turned into a fatal TypeError. MySQL string literals may legitimately
carry non-UTF-8 bytes (binary or other-charset payloads), and the lexer scans
them at the byte level, so reading the value of such a literal crashed.

Switch the modifier to "s" (DOTALL). A byte-wise strip is binary-safe, yields
identical results for valid UTF-8 (no continuation byte is a backslash), and
additionally handles a backslash preceding a newline byte.
WP_MySQL_Token::get_value() routed backtick-quoted identifiers through the same
backslash-unescaping path as string literals. In MySQL a backslash is never an
escape inside `...` identifiers; only a doubled backtick is. As a result an
identifier such as `a\nb` came back as "a<newline>b", silently corrupting any
table, column, or alias name that contains a backslash.

Treat backtick identifiers like the NO_BACKSLASH_ESCAPES path: collapse only the
doubled bounding backtick and keep every other byte literal.
When resolving a function keyword (SYM_FN), the lexer peeks for a following "("
and, under SQL_MODE_IGNORE_SPACE, skips intervening whitespace first. It skipped
by advancing bytes_already_read and never restored it. When no "(" followed, the
keyword was emitted as an IDENTIFIER whose length — derived from
bytes_already_read in produce() — now covered the trailing whitespace, so the
extracted value was e.g. "COUNT " instead of "COUNT". Under this ANSI-style mode
a column or table named after a function would resolve to the wrong identifier.

Peek with a local index instead of mutating bytes_already_read, so the token's
byte range ends at the keyword and the next scan consumes the whitespace.
read_mysql_comment() read at most five version digits, so a six-digit MMmmrr
version comment — added in MySQL 8.4 — was misparsed: /*!100000 ... */ gated as
version 10000 instead of 100000, and the sixth digit of /*!080400 ... */ leaked
into the comment body as SQL.

Mirror MySQL's own lexer rule (sql/sql_lex.cc): the first five characters must be
digits; a sixth digit immediately followed by whitespace extends the version to
six digits; otherwise the version stays five digits and any extra is content.
Left-recursive grammar list rules nest through their own rule name
("list: list ',' item | item"). The new accessor collects child nodes
of the whole nested chain in source order, as if the list were flat,
which is how AST consumers want to iterate list items.
With the ANSI_QUOTES SQL mode, MySQL treats double-quoted text as a
quoted identifier instead of a string literal. Emit an identifier token
for it, so identifier positions accept double-quoted names.
Replace the hand-written recursive parser with the table-driven LALR(1)
parser generated from MySQL's official grammar, consumed as a Composer
dependency:

- Require wordpress/mysql-parser, resolved from the monorepo sibling
  package via a Composer path repository, and load it through the
  Composer autoloader in the driver loader.
- Drop the old parser machinery (WP_Parser, WP_Parser_Grammar, the
  lexer, the parse tree classes, and mysql-grammar.php), all provided
  by the parser package now, and the native parser fork, which is bound
  to the old grammar contract.
- Parse multi-statement input by splitting the token stream on top-level
  ';' separators, as the grammar parses a single statement (this is how
  MySQL clients split multi-statement input).
- Re-key the statement dispatch to the sql_yacc.yy rule names and map
  keyword token constants to the grammar keyword table.

The translation layer still needs to be ported to the new AST shapes.
Re-key the SQL-to-SQLite translation from the old hand-written grammar
to the sql_yacc.yy rule names and tree shapes:

- Rewrite the translate() special cases and per-statement handlers
  (SELECT, INSERT/REPLACE, UPDATE, DELETE, DDL, SHOW, SET, USE,
  transactions and locking, administration statements).
- Iterate grammar lists with the flattened child node accessor, as
  lists are left-recursive in the new grammar.
- Walk JOINs recursively when building the table reference map, as
  joins nest through the left operand in the new grammar.
- Retry parsing with the ANSI_QUOTES SQL mode when a query fails to
  parse. MySQL rejects double-quoted identifiers without ANSI_QUOTES,
  but WordPress relies on them (dbDelta can produce double-quoted index
  names) and the previous parser accepted them.
Re-key CREATE TABLE, ALTER TABLE, and index statement analysis to the
sql_yacc.yy rule names and tree shapes. The recorded information schema
rows are unchanged: a battery of DDL statements covering all supported
data types, constraints, indexes, and table options produces the exact
same rows as the previous parser and builder.

Multi-column ADD COLUMN (a INT, b INT) is now recorded correctly; the
previous builder crashed on it.
The lexer, parser, token data, and parse tree classes are tested in the
wordpress/mysql-parser package now:

- Remove the lexer and parser test suites from the driver package (the
  corpus data stays here; the parser package corpus test reads it from
  the sibling package and skips when it is not available).
- Move the parse tree node tests to the parser package and cover the
  new flattened child node accessor.
- Remove the native parser extension tests and tools, which are bound
  to the old grammar contract.
- Update the AST dump and benchmark tools to the new parser API.
@JanJakes JanJakes force-pushed the lalr-parser-driver branch from 187593a to 816ee66 Compare June 19, 2026 14:37
The SQLite driver now loads the MySQL parser as a Composer dependency,
and the native parser extension bound to the old grammar is gone:

- Install the driver Composer dependencies in the WordPress test setup
  and mount the package vendor directory and the parser package into
  the WordPress containers.
- Bundle the driver's production Composer dependencies into the plugin
  zip, resolving the path-repository symlink into a real copy of the
  parser package.
- Run the driver test workflow against changes to the parser package
  and drop the native parser extension jobs and setup scripts.
- Install the driver Composer dependencies in the lexer benchmark
  workflow.
@JanJakes JanJakes force-pushed the lalr-parser-driver branch from 816ee66 to 47a1035 Compare June 19, 2026 14:54
adamziel pushed a commit that referenced this pull request Jun 23, 2026
> [!NOTE]
> The changed line numbers are misleading—about 115,000 added lines is
just a testing query corpus.
> (Copied to the new `mysql-parser` package from `mysql-on-sqlite`.)

## LALR(1) parser from official MySQL grammar

A new experimental **`packages/mysql-parser` package** that implements a
universal **LALR(1) parser** and builds a MySQL parse table from the
**official MySQL grammar**.

This is the initial implementation, not used anywhere in the driver yet.

A full driver migration to this new parser is AI-prototyped in
#432.

### What it does

- **Grammar processing pipeline:** Fetch sources → Bison → generate
parse table and token data.
- **Lexer:** The existing MySQL lexer was copied and adapted to the new
LALR(1) grammar.
- **Parser:** A new universal LALR(1) parser implementation.
- **MySQL grammar:** A compacted MySQL 8.4 LTS grammar, extracted using
the grammar processing pipeline.
- **MySQL query corpus:** The ~70k MySQL query corpus was copied and
updated to MySQL 8.4 LTS.
- **Benchmark:** A no-JIT/JIT lexer + parser benchmark.
- **Test suite:** New tests and a CI job.

### What it doesn't do yet

- **Replace the current parser:** It's a standalone package that doesn't
replace the existing parser yet.
- **Multi-version:** For now, the parser only tracks MySQL 8.4 LTS.
Multi-version will be done as a follow-up.

### Benchmarks

Measured on MacBook Pro M4 Max on PHP 8.4, the package's 8.4.10 corpus
~70k queries, end-to-end (lex + parse), best of 5 timed passes after 2
warmups:

| Metric | LL (trunk) | LALR (this) |
| --- | --- | --- |
| Throughput, no JIT | 11,010 QPS | **59,457 QPS** |
| Throughput, warm JIT | 24,393 QPS | **112,759 QPS** |
| Cold boot, no opcache | **~1.9 ms** | ~2.7 ms |
| Warm boot, opcache | ~0.6 ms | **~0.3 ms** |
| Memory, no opcache | **~3.4 MB** | ~5.4 MB |
| Memory, opcache worker | **~1.8 MB** | ~3.1 MB |
| Generated parser/table file size | **65 KB** | 177 KB |
| Full size (lexer + parser + grammar) | **246 KB** | 260 KB |

This parser is over **5× faster** without JIT and over **4.5× faster**
with JIT. Cold boot is a bit slower; warm boot is faster. The memory
footprint is a bit higher, and the overall size about 14 KB higher.

#### Recognize-only

The same lex+parse runs but building **no AST**, measuring only
recognition without AST allocation:

| Throughput | LL (trunk) | LALR (this) |
| --- | --- | --- |
| no JIT | 16,359 QPS | **95,374 QPS** |
| warm JIT | 49,940 QPS | **210,032 QPS** |

Dropping AST construction lifts both by ~1.5–2×, but the gap stays
around **~4.2–5.8×**.
Base automatically changed from lalr-parser to trunk June 23, 2026 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant