perf(pdf-server): lazy form extraction via range transport + incremental viewer scans by ochafik · Pull Request #639 · modelcontextprotocol/ext-apps

ochafik · 2026-04-24T18:26:29Z

Summary

display_pdf previously downloaded and parsed the entire PDF server-side to extract form metadata, defeating the viewer's chunked streaming. This PR makes both server and viewer truly incremental, and adds regression e2e tests with a range-counting fixture.

Server (`server.ts`)

PdfCacheRangeTransport lets getDocument() fetch only the byte ranges it needs. Exposes a .failed promise (pdfjs has no upstream error channel) and accumulates >512KB coalesced requests into a single onDataRange call (pdfjs's reader is keyed by exact begin).
probeFormFields() opens the document via range transport, calls getFieldObjects() first, and skips the per-page walk when empty → form-free PDFs are probed with ~10–25% of bytes instead of 100%. All errors (including transport .failed) resolve to empty results, so display_pdf never hangs on a mid-load network failure.
extractFormSchema handles separated field/widget trees (pdf-lib, some authoring tools) by picking the first array entry with a non-empty type.
validateUrl accepts http://127.0.0.1|localhost|[::1] only when PDF_SERVER_ALLOW_LOOPBACK_HTTP is set (test fixtures, local dev). Off by default.
Dead viewFieldInfo Map removed.

Viewer (`mcp-app.ts`)

disableAutoFetch / disableStream on getDocument so pdfjs doesn't background-prefetch.
Baseline annotation scan and field-name map run lazily per rendered page; per-annotation try/catch isolation preserved.
restoredRemovedIds tombstones are unioned into the persisted diff and removedRefs while the lazy baseline scan is incomplete, so a deletion on an unvisited page survives a persistAnnotations triggered elsewhere.

Regression e2e (`tests/e2e/pdf-incremental-load.spec.ts` + `tests/helpers/range-counting-server.ts`)

Plain-HTTP fixture (loopback only — playwright config sets PDF_SERVER_ALLOW_LOOPBACK_HTTP=1). Serves three pdf-lib-generated PDFs (form-free 20pg with a 1.1MB image on pages 2+; 5-field form; 2-page with one native /Text annot), records bytes/overlap, supports stallAfterBytes. 4 tests:

display_pdf on a form PDF returns the 5 expected formFields in the initial response
Form-free display_pdf fetches <30% of the file (server-side, before iframe loads); after first paint, byte-overlap < 50%
Page 1 renders while later ranges are stalled; after release, page 2 (>512KB image) renders and >90% has been served
Tombstone preservation through the full viewer (delete on page 2 → reload → edit on page 1 → reload → still gone)

Unit-test additions

PdfCacheRangeTransport: chunked accumulation, .failed rejection, empty-range guard, and a >1MB integration test that drives getOperatorList() and asserts max(requestDataRange span) > MAX_CHUNK_BYTES so it can't go vacuous
display_pdf via in-memory MCP client returns (not hangs) on mid-load fetch failure
extractFormSchema field-tree handling: pdf-lib, multi-widget, W-9/XFA, no-form
validateUrl loopback gate: env-on/off, non-loopback http still rejected
computeDiff/serializeDiff contract tests pinning the tombstone-union behaviour

Measurements

	Before	After
`display_pdf` form-free PDF (520KB)	100% downloaded	24% (3 range requests)
`display_pdf` W-9 (form, 140KB) p50 latency, stateless	0.555s	0.231s (2.4×)
Viewer first paint blocked on full file	yes	no — page 1 renders at ~48% served

Out of scope (pre-existing, tracked separately)

startPreloading() (search-index builder) still walks all pages after first paint
Server↔viewer xref overlap (~25%) — would need server to pass fetched bytes as initialData
server.ts module split, Map<uuid, ViewSession> consolidation, waitFor*/submit_*/InteractCommandSchema triplication

Test plan

194 unit tests pass (15 new)
tsc --noEmit clean
Server-side validated via direct MCP + range-counting fixture (HTTP, no TLS)
First-page-under-stall validated via headless browser
Mutation-tested: 6/7 unit-level guards confirmed non-vacuous (each fails on revert). The tombstone computeDiff contract tests are intentionally documentation-level — the e2e tombstone test is the regression guard for the persistAnnotations glue.
CI runs the e2e spec

…tal viewer scans Server: display_pdf now opens the document via PDFDataRangeTransport (disableAutoFetch) and only runs the per-page form/annotation walk when getFieldObjects() is non-empty, so form-free PDFs are probed with ~10-25% of bytes instead of a full download. The unused viewFieldInfo Map is removed. Viewer: getDocument sets disableAutoFetch/disableStream; baseline annotation scan and field-name mapping run lazily per rendered page instead of walking every page after load, so first paint no longer schedules a full-file pull. E2E: new range-counting HTTPS fixture (W-9 for forms, generated text+image PDF for no-forms) with stallAfterBytes control, and four regression tests asserting form fields are returned, <30% served on no-forms display_pdf, first page renders while later ranges are stalled, and overlap stays bounded.

…hema pdfjs getFieldObjects() returns the full field-tree array. For PDFs with a separated structure (pdf-lib, some authoring tools) the typed widget sits at fields[1+] behind a typeless container at fields[0]; the previous code only inspected fields[0] and skipped them all. Pick the first entry with a non-empty type instead. Makes the e2e forms.pdf fixture fully generated (no checked-in third-party asset on the hot path); fw9.pdf stays as a unit-test fixture for the hierarchical/XFA case.

pkg-pr-new · 2026-04-24T18:29:41Z

Open in StackBlitz

@modelcontextprotocol/ext-apps

npm i https://pkg.pr.new/@modelcontextprotocol/ext-apps@639

@modelcontextprotocol/server-basic-preact

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-preact@639

@modelcontextprotocol/server-basic-react

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-react@639

@modelcontextprotocol/server-basic-solid

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-solid@639

@modelcontextprotocol/server-basic-svelte

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-svelte@639

@modelcontextprotocol/server-basic-vanillajs

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-vanillajs@639

@modelcontextprotocol/server-basic-vue

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-vue@639

@modelcontextprotocol/server-budget-allocator

npm i https://pkg.pr.new/@modelcontextprotocol/server-budget-allocator@639

@modelcontextprotocol/server-cohort-heatmap

npm i https://pkg.pr.new/@modelcontextprotocol/server-cohort-heatmap@639

@modelcontextprotocol/server-customer-segmentation

npm i https://pkg.pr.new/@modelcontextprotocol/server-customer-segmentation@639

@modelcontextprotocol/server-debug

npm i https://pkg.pr.new/@modelcontextprotocol/server-debug@639

@modelcontextprotocol/server-map

npm i https://pkg.pr.new/@modelcontextprotocol/server-map@639

@modelcontextprotocol/server-pdf

npm i https://pkg.pr.new/@modelcontextprotocol/server-pdf@639

@modelcontextprotocol/server-scenario-modeler

npm i https://pkg.pr.new/@modelcontextprotocol/server-scenario-modeler@639

@modelcontextprotocol/server-shadertoy

npm i https://pkg.pr.new/@modelcontextprotocol/server-shadertoy@639

@modelcontextprotocol/server-sheet-music

npm i https://pkg.pr.new/@modelcontextprotocol/server-sheet-music@639

@modelcontextprotocol/server-system-monitor

npm i https://pkg.pr.new/@modelcontextprotocol/server-system-monitor@639

@modelcontextprotocol/server-threejs

npm i https://pkg.pr.new/@modelcontextprotocol/server-threejs@639

@modelcontextprotocol/server-transcript

npm i https://pkg.pr.new/@modelcontextprotocol/server-transcript@639

@modelcontextprotocol/server-video-resource

npm i https://pkg.pr.new/@modelcontextprotocol/server-video-resource@639

@modelcontextprotocol/server-wiki-explorer

npm i https://pkg.pr.new/@modelcontextprotocol/server-wiki-explorer@639

commit: b04b66a

…e loss PdfCacheRangeTransport: - abort() is a no-op stub on PDFDataRangeTransport (it's the hook pdfjs calls *on* the transport, not an upstream error channel). Expose a .failed promise that rejects on the first fetch error and race every pdfjs await against it in display_pdf, so transient network errors surface into the existing catch instead of hanging the tool call. - pdfjs coalesces adjacent missing chunks into one unbounded requestDataRange; readPdfRange clamps each call to MAX_CHUNK_BYTES. Loop and deliver in slices so every requested chunk is marked loaded. Viewer (mcp-app.ts): - The lazy per-page baseline scan left pdfBaselineAnnotations incomplete, so persistAnnotations and getAnnotatedPdfBytes silently dropped restoredRemovedIds tombstones for unvisited pages. Union those ids into the computed diff and removedRefs. Test fixture: release stalled handlers before resetStats/close so a failing stalled test doesn't hang afterAll; fail fast if started with NODE_ENV=production. NODE_TLS_REJECT_UNAUTHORIZED scope documented (full per-process scoping needs a validateUrl localhost allow, tracked separately).

…opback HTTP PdfCacheRangeTransport.deliver(): pdf.js's reader is keyed by the original begin and removed after one delivery, so accumulate slices and call onDataRange once with the full buffer (the previous multi-call approach threw inside pdfjs). Covered by a new integration test that drives getDocument()/getPage(1) on a >1MB PDF through a clamping in-memory readPdfRange. validateUrl: allow http://127.0.0.1|localhost|[::1] only when PDF_SERVER_ALLOW_LOOPBACK_HTTP is set, so a remote deploy can't be made to probe its own ports. Covered by env-on/off unit tests. Fixture switched to plain HTTP (no openssl, no NODE_TLS_REJECT_UNAUTHORIZED). Adds /error.pdf (500s after 50KB) and two e2e tests: page 2 renders after stall release (>512KB object path), and display_pdf returns gracefully on mid-load 500. Existing <30% test now samples stats before the iframe loads. New unit tests: display_pdf returns (not hangs) on mid-load fetch failure via in-memory MCP client; computeDiff/serializeDiff contract tests pinning the restoredRemovedIds tombstone-preservation behaviour.

…hrough viewer Adds /with-native-annot.pdf fixture (2 pages, native /Text annot on page 2) and a Playwright test that drives delete on page 2 → iframe reload → interact add_annotations on page 1 (persist with page 2 unscanned) → assert localStorage diff still contains the removed id → page 2 still shows it gone. Covers the mcp-app.ts glue the unit tests can't reach.

…test server.ts: extract probeFormFields() so display_pdf calls one helper instead of inlining 40 lines of transport/orFail plumbing; make extractFormSchema's fieldObjects param required (the optional branch only existed for a test helper). server.test.ts: the >1MB integration test now forces the image XObject fetch via getOperatorList() and asserts max requestDataRange span > MAX_CHUNK_BYTES, so it can't go vacuous. Dedupe makeRandomJpeg by importing from the fixture. mcp-app.ts: restore per-annotation try/catch isolation in the lazy baseline scan (a throw was skipping AnnotationLayer.render); only carry forward restoredRemovedIds while the baseline scan is incomplete so a stale id can't pin dirty=true once every page is scanned. tests: bump fixture JPEG to 1.1MB; consolidate 7 e2e tests to 4 (merge overlap into the byte-budget test, merge stall+page-2, drop the error-e2e duplicate of the unit integration test); delete run-fixture.mjs; drop /error.pdf and ERROR_AFTER_BYTES; rename tombstone test's describe and clarify page-2 covers the viewer transport.

…tFormFieldInfo throws

…me a cleared card, not removed from DOM

…me-reload replay support

ochafik added 2 commits April 24, 2026 17:15

ochafik added 4 commits April 24, 2026 19:43

ochafik marked this pull request as ready for review April 27, 2026 13:26

ochafik added 3 commits April 27, 2026 14:35

fix(pdf-server): preserve partial probeFormFields results when extrac…

76952b5

…tFormFieldInfo throws

test(pdf-server): fix tombstone e2e — deleted native annotations beco…

b1f5fd3

…me a cleared card, not removed from DOM

test(pdf-server): mark tombstone e2e as fixme — needs basic-host ifra…

5f250e3

…me-reload replay support

ochafik mentioned this pull request Apr 27, 2026

pdf-server: make tombstone-persistence e2e runnable (basic-host iframe-reload replay or URL-keyed storage) #642

Open

test(pdf-server): link tombstone fixme to tracking issue #642

b04b66a

ochafik merged commit ba31be1 into main Apr 27, 2026
20 checks passed

ochafik mentioned this pull request Apr 27, 2026

chore: bump ext-apps to 1.7.1 #643

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(pdf-server): lazy form extraction via range transport + incremental viewer scans#639

perf(pdf-server): lazy form extraction via range transport + incremental viewer scans#639
ochafik merged 10 commits intomainfrom
feat/pdf-lazy-form-extraction

ochafik commented Apr 24, 2026 •

edited

Loading

Uh oh!

pkg-pr-new Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ochafik commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Server (server.ts)

Viewer (mcp-app.ts)

Regression e2e (tests/e2e/pdf-incremental-load.spec.ts + tests/helpers/range-counting-server.ts)

Unit-test additions

Measurements

Out of scope (pre-existing, tracked separately)

Test plan

Uh oh!

pkg-pr-new Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ochafik commented Apr 24, 2026 •

edited

Loading

Server (`server.ts`)

Viewer (`mcp-app.ts`)

Regression e2e (`tests/e2e/pdf-incremental-load.spec.ts` + `tests/helpers/range-counting-server.ts`)

pkg-pr-new Bot commented Apr 24, 2026 •

edited

Loading