Skip to content

perf(pdf-server): lazy form extraction via range transport + incremental viewer scans#639

Merged
ochafik merged 10 commits intomainfrom
feat/pdf-lazy-form-extraction
Apr 27, 2026
Merged

perf(pdf-server): lazy form extraction via range transport + incremental viewer scans#639
ochafik merged 10 commits intomainfrom
feat/pdf-lazy-form-extraction

Conversation

@ochafik
Copy link
Copy Markdown
Contributor

@ochafik ochafik commented Apr 24, 2026

Summary

display_pdf previously downloaded and parsed the entire PDF server-side to extract form metadata, defeating the viewer's chunked streaming. This PR makes both server and viewer truly incremental, and adds regression e2e tests with a range-counting fixture.

Server (server.ts)

  • PdfCacheRangeTransport lets getDocument() fetch only the byte ranges it needs. Exposes a .failed promise (pdfjs has no upstream error channel) and accumulates >512KB coalesced requests into a single onDataRange call (pdfjs's reader is keyed by exact begin).
  • probeFormFields() opens the document via range transport, calls getFieldObjects() first, and skips the per-page walk when empty → form-free PDFs are probed with ~10–25% of bytes instead of 100%. All errors (including transport .failed) resolve to empty results, so display_pdf never hangs on a mid-load network failure.
  • extractFormSchema handles separated field/widget trees (pdf-lib, some authoring tools) by picking the first array entry with a non-empty type.
  • validateUrl accepts http://127.0.0.1|localhost|[::1] only when PDF_SERVER_ALLOW_LOOPBACK_HTTP is set (test fixtures, local dev). Off by default.
  • Dead viewFieldInfo Map removed.

Viewer (mcp-app.ts)

  • disableAutoFetch / disableStream on getDocument so pdfjs doesn't background-prefetch.
  • Baseline annotation scan and field-name map run lazily per rendered page; per-annotation try/catch isolation preserved.
  • restoredRemovedIds tombstones are unioned into the persisted diff and removedRefs while the lazy baseline scan is incomplete, so a deletion on an unvisited page survives a persistAnnotations triggered elsewhere.

Regression e2e (tests/e2e/pdf-incremental-load.spec.ts + tests/helpers/range-counting-server.ts)

Plain-HTTP fixture (loopback only — playwright config sets PDF_SERVER_ALLOW_LOOPBACK_HTTP=1). Serves three pdf-lib-generated PDFs (form-free 20pg with a 1.1MB image on pages 2+; 5-field form; 2-page with one native /Text annot), records bytes/overlap, supports stallAfterBytes. 4 tests:

  1. display_pdf on a form PDF returns the 5 expected formFields in the initial response
  2. Form-free display_pdf fetches <30% of the file (server-side, before iframe loads); after first paint, byte-overlap < 50%
  3. Page 1 renders while later ranges are stalled; after release, page 2 (>512KB image) renders and >90% has been served
  4. Tombstone preservation through the full viewer (delete on page 2 → reload → edit on page 1 → reload → still gone)

Unit-test additions

  • PdfCacheRangeTransport: chunked accumulation, .failed rejection, empty-range guard, and a >1MB integration test that drives getOperatorList() and asserts max(requestDataRange span) > MAX_CHUNK_BYTES so it can't go vacuous
  • display_pdf via in-memory MCP client returns (not hangs) on mid-load fetch failure
  • extractFormSchema field-tree handling: pdf-lib, multi-widget, W-9/XFA, no-form
  • validateUrl loopback gate: env-on/off, non-loopback http still rejected
  • computeDiff/serializeDiff contract tests pinning the tombstone-union behaviour

Measurements

Before After
display_pdf form-free PDF (520KB) 100% downloaded 24% (3 range requests)
display_pdf W-9 (form, 140KB) p50 latency, stateless 0.555s 0.231s (2.4×)
Viewer first paint blocked on full file yes no — page 1 renders at ~48% served

Out of scope (pre-existing, tracked separately)

  • startPreloading() (search-index builder) still walks all pages after first paint
  • Server↔viewer xref overlap (~25%) — would need server to pass fetched bytes as initialData
  • server.ts module split, Map<uuid, ViewSession> consolidation, waitFor*/submit_*/InteractCommandSchema triplication

Test plan

  • 194 unit tests pass (15 new)
  • tsc --noEmit clean
  • Server-side validated via direct MCP + range-counting fixture (HTTP, no TLS)
  • First-page-under-stall validated via headless browser
  • Mutation-tested: 6/7 unit-level guards confirmed non-vacuous (each fails on revert). The tombstone computeDiff contract tests are intentionally documentation-level — the e2e tombstone test is the regression guard for the persistAnnotations glue.
  • CI runs the e2e spec

ochafik added 2 commits April 24, 2026 17:15
…tal viewer scans

Server: display_pdf now opens the document via PDFDataRangeTransport
(disableAutoFetch) and only runs the per-page form/annotation walk when
getFieldObjects() is non-empty, so form-free PDFs are probed with ~10-25%
of bytes instead of a full download. The unused viewFieldInfo Map is
removed.

Viewer: getDocument sets disableAutoFetch/disableStream; baseline
annotation scan and field-name mapping run lazily per rendered page
instead of walking every page after load, so first paint no longer
schedules a full-file pull.

E2E: new range-counting HTTPS fixture (W-9 for forms, generated
text+image PDF for no-forms) with stallAfterBytes control, and four
regression tests asserting form fields are returned, <30% served on
no-forms display_pdf, first page renders while later ranges are stalled,
and overlap stays bounded.
…hema

pdfjs getFieldObjects() returns the full field-tree array. For PDFs with
a separated structure (pdf-lib, some authoring tools) the typed widget
sits at fields[1+] behind a typeless container at fields[0]; the previous
code only inspected fields[0] and skipped them all. Pick the first entry
with a non-empty type instead.

Makes the e2e forms.pdf fixture fully generated (no checked-in third-party
asset on the hot path); fw9.pdf stays as a unit-test fixture for the
hierarchical/XFA case.
@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented Apr 24, 2026

Open in StackBlitz

@modelcontextprotocol/ext-apps

npm i https://pkg.pr.new/@modelcontextprotocol/ext-apps@639

@modelcontextprotocol/server-basic-preact

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-preact@639

@modelcontextprotocol/server-basic-react

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-react@639

@modelcontextprotocol/server-basic-solid

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-solid@639

@modelcontextprotocol/server-basic-svelte

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-svelte@639

@modelcontextprotocol/server-basic-vanillajs

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-vanillajs@639

@modelcontextprotocol/server-basic-vue

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-vue@639

@modelcontextprotocol/server-budget-allocator

npm i https://pkg.pr.new/@modelcontextprotocol/server-budget-allocator@639

@modelcontextprotocol/server-cohort-heatmap

npm i https://pkg.pr.new/@modelcontextprotocol/server-cohort-heatmap@639

@modelcontextprotocol/server-customer-segmentation

npm i https://pkg.pr.new/@modelcontextprotocol/server-customer-segmentation@639

@modelcontextprotocol/server-debug

npm i https://pkg.pr.new/@modelcontextprotocol/server-debug@639

@modelcontextprotocol/server-map

npm i https://pkg.pr.new/@modelcontextprotocol/server-map@639

@modelcontextprotocol/server-pdf

npm i https://pkg.pr.new/@modelcontextprotocol/server-pdf@639

@modelcontextprotocol/server-scenario-modeler

npm i https://pkg.pr.new/@modelcontextprotocol/server-scenario-modeler@639

@modelcontextprotocol/server-shadertoy

npm i https://pkg.pr.new/@modelcontextprotocol/server-shadertoy@639

@modelcontextprotocol/server-sheet-music

npm i https://pkg.pr.new/@modelcontextprotocol/server-sheet-music@639

@modelcontextprotocol/server-system-monitor

npm i https://pkg.pr.new/@modelcontextprotocol/server-system-monitor@639

@modelcontextprotocol/server-threejs

npm i https://pkg.pr.new/@modelcontextprotocol/server-threejs@639

@modelcontextprotocol/server-transcript

npm i https://pkg.pr.new/@modelcontextprotocol/server-transcript@639

@modelcontextprotocol/server-video-resource

npm i https://pkg.pr.new/@modelcontextprotocol/server-video-resource@639

@modelcontextprotocol/server-wiki-explorer

npm i https://pkg.pr.new/@modelcontextprotocol/server-wiki-explorer@639

commit: b04b66a

ochafik added 4 commits April 24, 2026 19:43
…e loss

PdfCacheRangeTransport:
- abort() is a no-op stub on PDFDataRangeTransport (it's the hook pdfjs
  calls *on* the transport, not an upstream error channel). Expose a
  .failed promise that rejects on the first fetch error and race every
  pdfjs await against it in display_pdf, so transient network errors
  surface into the existing catch instead of hanging the tool call.
- pdfjs coalesces adjacent missing chunks into one unbounded
  requestDataRange; readPdfRange clamps each call to MAX_CHUNK_BYTES.
  Loop and deliver in slices so every requested chunk is marked loaded.

Viewer (mcp-app.ts):
- The lazy per-page baseline scan left pdfBaselineAnnotations incomplete,
  so persistAnnotations and getAnnotatedPdfBytes silently dropped
  restoredRemovedIds tombstones for unvisited pages. Union those ids into
  the computed diff and removedRefs.

Test fixture: release stalled handlers before resetStats/close so a
failing stalled test doesn't hang afterAll; fail fast if started with
NODE_ENV=production. NODE_TLS_REJECT_UNAUTHORIZED scope documented (full
per-process scoping needs a validateUrl localhost allow, tracked
separately).
…opback HTTP

PdfCacheRangeTransport.deliver(): pdf.js's reader is keyed by the
original begin and removed after one delivery, so accumulate slices and
call onDataRange once with the full buffer (the previous multi-call
approach threw inside pdfjs). Covered by a new integration test that
drives getDocument()/getPage(1) on a >1MB PDF through a clamping
in-memory readPdfRange.

validateUrl: allow http://127.0.0.1|localhost|[::1] only when
PDF_SERVER_ALLOW_LOOPBACK_HTTP is set, so a remote deploy can't be made
to probe its own ports. Covered by env-on/off unit tests.

Fixture switched to plain HTTP (no openssl, no
NODE_TLS_REJECT_UNAUTHORIZED). Adds /error.pdf (500s after 50KB) and two
e2e tests: page 2 renders after stall release (>512KB object path), and
display_pdf returns gracefully on mid-load 500. Existing <30% test now
samples stats before the iframe loads.

New unit tests: display_pdf returns (not hangs) on mid-load fetch
failure via in-memory MCP client; computeDiff/serializeDiff contract
tests pinning the restoredRemovedIds tombstone-preservation behaviour.
…hrough viewer

Adds /with-native-annot.pdf fixture (2 pages, native /Text annot on
page 2) and a Playwright test that drives delete on page 2 → iframe
reload → interact add_annotations on page 1 (persist with page 2
unscanned) → assert localStorage diff still contains the removed id →
page 2 still shows it gone. Covers the mcp-app.ts glue the unit tests
can't reach.
…test

server.ts: extract probeFormFields() so display_pdf calls one helper
instead of inlining 40 lines of transport/orFail plumbing; make
extractFormSchema's fieldObjects param required (the optional branch
only existed for a test helper).

server.test.ts: the >1MB integration test now forces the image XObject
fetch via getOperatorList() and asserts max requestDataRange span >
MAX_CHUNK_BYTES, so it can't go vacuous. Dedupe makeRandomJpeg by
importing from the fixture.

mcp-app.ts: restore per-annotation try/catch isolation in the lazy
baseline scan (a throw was skipping AnnotationLayer.render); only
carry forward restoredRemovedIds while the baseline scan is incomplete
so a stale id can't pin dirty=true once every page is scanned.

tests: bump fixture JPEG to 1.1MB; consolidate 7 e2e tests to 4 (merge
overlap into the byte-budget test, merge stall+page-2, drop the
error-e2e duplicate of the unit integration test); delete
run-fixture.mjs; drop /error.pdf and ERROR_AFTER_BYTES; rename
tombstone test's describe and clarify page-2 covers the viewer
transport.
@ochafik ochafik marked this pull request as ready for review April 27, 2026 13:26
@ochafik ochafik merged commit ba31be1 into main Apr 27, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant