Skip to content

feat(pipeline): add PDF as a first-class InDesign input#101

Merged
PAMulligan merged 1 commit into
mainfrom
64-indesign-pipeline-pdf-ingestion-as-primary-input-designer-to-developer-pdf-handoff
Jun 20, 2026
Merged

feat(pipeline): add PDF as a first-class InDesign input#101
PAMulligan merged 1 commit into
mainfrom
64-indesign-pipeline-pdf-ingestion-as-primary-input-designer-to-developer-pdf-handoff

Conversation

@PAMulligan

Copy link
Copy Markdown
Collaborator

Summary

Makes PDF a first-class input to the InDesign-to-React pipeline (#64, part of #62). InDesign-exported PDFs are what designers actually hand engineering, so the parser ingests them into the same IR the IDML parser emits — keeping the token mapper (#65) and the upcoming component generator (#66) source-agnostic.

New packages/pipeline/src/pdf/ subsystem (via pdfjs-dist) plus a source-agnostic parseSourceFile entry and CLI integration.

What it does

  • Text — extracts positioned glyph runs and clusters them: runs → lines (split on large gaps so columns don't merge) → columns → text frames.
  • Styles — infers a heading/body/caption scale from font-size buckets, synthesized as ParagraphStyles the mapper clusters exactly as IDML styles.
  • Color — derives a swatch palette from operator-list fills/strokes and dominant image colors (CMYK/RGB/Gray → sRGB), de-duplicated within tolerance.
  • Images — extracts embedded image XObjects, decodes to PNG (pngjs), and writes them to an --assets dir, addressable from the IR.
  • Geometry — flips PDF's bottom-left origin to the IR's top-left pixel space at a configurable DPI.
  • Fidelity — emits TEXT_FROM_GLYPHS, NO_EMBEDDED_FONTS, VECTOR_ONLY_PAGE, MULTI_COLUMN_DETECTED, IMAGE_NOT_EXTRACTED warnings, surfaced by the CLI and documented in docs/pipeline/indesign-pdf-fidelity.md.

CLI & API

# Auto-detects .pdf vs .idml; --source-priority forces a path.
aurelius-indesign brochure.pdf --emit-tokens ./src/tokens --assets ./public/assets
import { parsePdfFile } from "@aurelius/pipeline/pdf";
import { parseSourceFile } from "@aurelius/pipeline";
const { document } = await parseSourceFile("brochure.pdf", { sourcePriority: "pdf", assetDir: "public/assets" });

runCli is now async; meta.source ("idml" | "pdf") was added to the IR.

Acceptance criteria (#64)

  • Fixture PDF → valid IR consumable by the mapper without a companion IDML (PDF → IR → tokens tested)
  • CLI accepts .pdf as a primary input and runs PDF → IR → tokens end-to-end
  • Parity test: same-shape IDML & PDF agree on page count and heading/body buckets within tolerance
  • Embedded images extracted and addressable from the IR
  • Swatch palette derived from the PDF when no IDML is supplied
  • Fidelity warnings in CLI output + documented in docs/pipeline/indesign-pdf-fidelity.md
  • README + pipeline docs describe PDF as first-class with a "designer hands you a PDF" quickstart
  • Unit tests cover text-heavy, image-heavy, multi-column, single-page brochure, and vector-only/outlined PDFs

Testing

  • pnpm --filter @aurelius/pipeline typecheck / test / build ✅ — 117 tests (24 new)
  • Repo-wide eslint . (0 errors) and prettier --check . ✅; check-doc-counts ✅; lockfile in sync (pnpm 9 compatible)
  • Verified the compiled CLI on a real PDF: PDF IR report, PDF → tokens, and image extraction (image-1.png) all work; config/font-map.json resolves from dist/

Notes

  • Not pixel-perfect by design — the goal is a usable, styled IR for the generator with manual touch-ups. IDML remains preferred when available (richer style metadata); force PDF with --source-priority pdf to verify parity.
  • pdfjs-dist runs in Node with no worker/canvas (data-extraction only). New deps: pdfjs-dist, pngjs (runtime), pdf-lib, @types/pngjs (dev, fixtures).

Closes #64
Part of #62

🤖 Generated with Claude Code

Parse InDesign-exported PDFs into the same IR as the IDML parser, so the token
mapper and component generator are source-agnostic (#64, part of #62).

- Extract text runs, fill/stroke colors, and embedded images via pdfjs-dist
- Cluster runs into text frames; infer heading/body/caption buckets from font
  sizes; detect columns; derive a swatch palette from fills and image colors
- Extract embedded images to PNG (pngjs), addressable from the IR
- Emit fidelity warnings (no embedded fonts, text-from-glyphs, vector-only page,
  multi-column, image-not-extracted) surfaced by the CLI
- CLI accepts .pdf as a primary input with --source-priority and --assets;
  runCli is now async; parseSourceFile auto-detects .idml vs .pdf
- Add meta.source ("idml" | "pdf") to the IR
- 24 new tests (117 total) across the five PDF shapes plus PDF->IR->tokens and
  IDML/PDF parity; pdf-lib generates hermetic fixtures
- Document fidelity in docs/pipeline/indesign-pdf-fidelity.md

Refs #64, #62

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@PAMulligan PAMulligan self-assigned this Jun 20, 2026
@PAMulligan PAMulligan added enhancement New feature or request pipeline Figma/Canva-to-React conversion pipeline performance Performance improvements react React-specific functionality labels Jun 20, 2026
@PAMulligan PAMulligan moved this from Todo to Done in PMDS Open Source Roadmap Jun 20, 2026
@PAMulligan PAMulligan merged commit 57a35c4 into main Jun 20, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request performance Performance improvements pipeline Figma/Canva-to-React conversion pipeline react React-specific functionality

Projects

Development

Successfully merging this pull request may close these issues.

[InDesign pipeline] PDF ingestion as primary input (designer-to-developer PDF handoff)

1 participant