perf(pdf-server): lazy form extraction via range transport + incremental viewer scans#639
Merged
perf(pdf-server): lazy form extraction via range transport + incremental viewer scans#639
Conversation
…tal viewer scans Server: display_pdf now opens the document via PDFDataRangeTransport (disableAutoFetch) and only runs the per-page form/annotation walk when getFieldObjects() is non-empty, so form-free PDFs are probed with ~10-25% of bytes instead of a full download. The unused viewFieldInfo Map is removed. Viewer: getDocument sets disableAutoFetch/disableStream; baseline annotation scan and field-name mapping run lazily per rendered page instead of walking every page after load, so first paint no longer schedules a full-file pull. E2E: new range-counting HTTPS fixture (W-9 for forms, generated text+image PDF for no-forms) with stallAfterBytes control, and four regression tests asserting form fields are returned, <30% served on no-forms display_pdf, first page renders while later ranges are stalled, and overlap stays bounded.
…hema pdfjs getFieldObjects() returns the full field-tree array. For PDFs with a separated structure (pdf-lib, some authoring tools) the typed widget sits at fields[1+] behind a typeless container at fields[0]; the previous code only inspected fields[0] and skipped them all. Pick the first entry with a non-empty type instead. Makes the e2e forms.pdf fixture fully generated (no checked-in third-party asset on the hot path); fw9.pdf stays as a unit-test fixture for the hierarchical/XFA case.
@modelcontextprotocol/ext-apps
@modelcontextprotocol/server-basic-preact
@modelcontextprotocol/server-basic-react
@modelcontextprotocol/server-basic-solid
@modelcontextprotocol/server-basic-svelte
@modelcontextprotocol/server-basic-vanillajs
@modelcontextprotocol/server-basic-vue
@modelcontextprotocol/server-budget-allocator
@modelcontextprotocol/server-cohort-heatmap
@modelcontextprotocol/server-customer-segmentation
@modelcontextprotocol/server-debug
@modelcontextprotocol/server-map
@modelcontextprotocol/server-pdf
@modelcontextprotocol/server-scenario-modeler
@modelcontextprotocol/server-shadertoy
@modelcontextprotocol/server-sheet-music
@modelcontextprotocol/server-system-monitor
@modelcontextprotocol/server-threejs
@modelcontextprotocol/server-transcript
@modelcontextprotocol/server-video-resource
@modelcontextprotocol/server-wiki-explorer
commit: |
…e loss PdfCacheRangeTransport: - abort() is a no-op stub on PDFDataRangeTransport (it's the hook pdfjs calls *on* the transport, not an upstream error channel). Expose a .failed promise that rejects on the first fetch error and race every pdfjs await against it in display_pdf, so transient network errors surface into the existing catch instead of hanging the tool call. - pdfjs coalesces adjacent missing chunks into one unbounded requestDataRange; readPdfRange clamps each call to MAX_CHUNK_BYTES. Loop and deliver in slices so every requested chunk is marked loaded. Viewer (mcp-app.ts): - The lazy per-page baseline scan left pdfBaselineAnnotations incomplete, so persistAnnotations and getAnnotatedPdfBytes silently dropped restoredRemovedIds tombstones for unvisited pages. Union those ids into the computed diff and removedRefs. Test fixture: release stalled handlers before resetStats/close so a failing stalled test doesn't hang afterAll; fail fast if started with NODE_ENV=production. NODE_TLS_REJECT_UNAUTHORIZED scope documented (full per-process scoping needs a validateUrl localhost allow, tracked separately).
…opback HTTP PdfCacheRangeTransport.deliver(): pdf.js's reader is keyed by the original begin and removed after one delivery, so accumulate slices and call onDataRange once with the full buffer (the previous multi-call approach threw inside pdfjs). Covered by a new integration test that drives getDocument()/getPage(1) on a >1MB PDF through a clamping in-memory readPdfRange. validateUrl: allow http://127.0.0.1|localhost|[::1] only when PDF_SERVER_ALLOW_LOOPBACK_HTTP is set, so a remote deploy can't be made to probe its own ports. Covered by env-on/off unit tests. Fixture switched to plain HTTP (no openssl, no NODE_TLS_REJECT_UNAUTHORIZED). Adds /error.pdf (500s after 50KB) and two e2e tests: page 2 renders after stall release (>512KB object path), and display_pdf returns gracefully on mid-load 500. Existing <30% test now samples stats before the iframe loads. New unit tests: display_pdf returns (not hangs) on mid-load fetch failure via in-memory MCP client; computeDiff/serializeDiff contract tests pinning the restoredRemovedIds tombstone-preservation behaviour.
…hrough viewer Adds /with-native-annot.pdf fixture (2 pages, native /Text annot on page 2) and a Playwright test that drives delete on page 2 → iframe reload → interact add_annotations on page 1 (persist with page 2 unscanned) → assert localStorage diff still contains the removed id → page 2 still shows it gone. Covers the mcp-app.ts glue the unit tests can't reach.
…test server.ts: extract probeFormFields() so display_pdf calls one helper instead of inlining 40 lines of transport/orFail plumbing; make extractFormSchema's fieldObjects param required (the optional branch only existed for a test helper). server.test.ts: the >1MB integration test now forces the image XObject fetch via getOperatorList() and asserts max requestDataRange span > MAX_CHUNK_BYTES, so it can't go vacuous. Dedupe makeRandomJpeg by importing from the fixture. mcp-app.ts: restore per-annotation try/catch isolation in the lazy baseline scan (a throw was skipping AnnotationLayer.render); only carry forward restoredRemovedIds while the baseline scan is incomplete so a stale id can't pin dirty=true once every page is scanned. tests: bump fixture JPEG to 1.1MB; consolidate 7 e2e tests to 4 (merge overlap into the byte-budget test, merge stall+page-2, drop the error-e2e duplicate of the unit integration test); delete run-fixture.mjs; drop /error.pdf and ERROR_AFTER_BYTES; rename tombstone test's describe and clarify page-2 covers the viewer transport.
…tFormFieldInfo throws
…me a cleared card, not removed from DOM
…me-reload replay support
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
display_pdfpreviously downloaded and parsed the entire PDF server-side to extract form metadata, defeating the viewer's chunked streaming. This PR makes both server and viewer truly incremental, and adds regression e2e tests with a range-counting fixture.Server (
server.ts)PdfCacheRangeTransportletsgetDocument()fetch only the byte ranges it needs. Exposes a.failedpromise (pdfjs has no upstream error channel) and accumulates >512KB coalesced requests into a singleonDataRangecall (pdfjs's reader is keyed by exactbegin).probeFormFields()opens the document via range transport, callsgetFieldObjects()first, and skips the per-page walk when empty → form-free PDFs are probed with ~10–25% of bytes instead of 100%. All errors (including transport.failed) resolve to empty results, sodisplay_pdfnever hangs on a mid-load network failure.extractFormSchemahandles separated field/widget trees (pdf-lib, some authoring tools) by picking the first array entry with a non-emptytype.validateUrlacceptshttp://127.0.0.1|localhost|[::1]only whenPDF_SERVER_ALLOW_LOOPBACK_HTTPis set (test fixtures, local dev). Off by default.viewFieldInfoMap removed.Viewer (
mcp-app.ts)disableAutoFetch/disableStreamongetDocumentso pdfjs doesn't background-prefetch.restoredRemovedIdstombstones are unioned into the persisted diff andremovedRefswhile the lazy baseline scan is incomplete, so a deletion on an unvisited page survives apersistAnnotationstriggered elsewhere.Regression e2e (
tests/e2e/pdf-incremental-load.spec.ts+tests/helpers/range-counting-server.ts)Plain-HTTP fixture (loopback only — playwright config sets
PDF_SERVER_ALLOW_LOOPBACK_HTTP=1). Serves three pdf-lib-generated PDFs (form-free 20pg with a 1.1MB image on pages 2+; 5-field form; 2-page with one native/Textannot), records bytes/overlap, supportsstallAfterBytes. 4 tests:display_pdfon a form PDF returns the 5 expectedformFieldsin the initial responsedisplay_pdffetches <30% of the file (server-side, before iframe loads); after first paint, byte-overlap < 50%Unit-test additions
PdfCacheRangeTransport: chunked accumulation,.failedrejection, empty-range guard, and a >1MB integration test that drivesgetOperatorList()and assertsmax(requestDataRange span) > MAX_CHUNK_BYTESso it can't go vacuousdisplay_pdfvia in-memory MCP client returns (not hangs) on mid-load fetch failureextractFormSchemafield-tree handling: pdf-lib, multi-widget, W-9/XFA, no-formvalidateUrlloopback gate: env-on/off, non-loopback http still rejectedcomputeDiff/serializeDiffcontract tests pinning the tombstone-union behaviourMeasurements
display_pdfform-free PDF (520KB)display_pdfW-9 (form, 140KB) p50 latency, statelessOut of scope (pre-existing, tracked separately)
startPreloading()(search-index builder) still walks all pages after first paintinitialDataserver.tsmodule split,Map<uuid, ViewSession>consolidation,waitFor*/submit_*/InteractCommandSchematriplicationTest plan
tsc --noEmitcleancomputeDiffcontract tests are intentionally documentation-level — the e2e tombstone test is the regression guard for thepersistAnnotationsglue.