Skip to content

Add PDF keycard parsing functionality#8946

Merged
mohammadalfaiyazbitgo merged 5 commits into
masterfrom
claude/compassionate-euler-DJa99
Jun 5, 2026
Merged

Add PDF keycard parsing functionality#8946
mohammadalfaiyazbitgo merged 5 commits into
masterfrom
claude/compassionate-euler-DJa99

Conversation

@mohammadalfaiyazbitgo
Copy link
Copy Markdown
Contributor

@mohammadalfaiyazbitgo mohammadalfaiyazbitgo commented Jun 4, 2026

Description

This PR adds functionality to extract and parse BitGo keycard data from PDF files. It includes:

  1. New parsing module (parseKeycard.ts): Core logic to parse keycard sections (A–D) from text lines extracted from PDFs, with robust handling of:

    • Multi-line section values (base64, xpub strings)
    • Page-break labels ("Part N") that appear standalone or embedded in content
    • JSON-formatted encrypted wallet passwords with proper brace counting to detect section boundaries
    • PDF text node reconstruction into logical lines based on spatial coordinates
  2. PDF extraction module (extractKeycardFromPDF.ts): Wrapper around pdfjs-dist to extract text nodes from PDF files and convert them to keycard entries

  3. Web demo integration: Added file upload UI in the KeyCard component to allow users to upload a keycard PDF and view extracted sections in a table

  4. Dependencies: Added pdfjs-dist (^4.0.0) to both @bitgo/key-card and web-demo packages

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Added comprehensive unit tests in parseKeycardFromLines.test.ts covering:

  • Happy path with clean JSON in a single line
  • Standalone "Part N" page-break labels
  • Multiple embedded "Part N" labels within base64 content
  • "Part N" labels fused mid-line into content

All tests verify correct reconstruction of encrypted wallet password JSON and other section values.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My code compiles correctly for both Node and Browser environments
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my feature works
  • New and existing unit tests pass locally with my changes

https://claude.ai/code/session_01Pj4SsrmSnoBNj8h6zX1vaz

@mohammadalfaiyazbitgo mohammadalfaiyazbitgo requested review from a team as code owners June 4, 2026 14:57
@mohammadalfaiyazbitgo mohammadalfaiyazbitgo marked this pull request as draft June 4, 2026 14:57
…d sdk

- add parseKeycard.ts with pure string parsing logic (parseKeycardFromLines,
  buildLinesFromPDFNodes, KeycardEntry, PDFTextNode types)
- add extractKeycardFromPDF.ts with pdfjs-dist based PDF extraction
  (browser-only; consumer must configure GlobalWorkerOptions.workerSrc)
- add pdfjs-dist ^5.0.0 dependency to @bitgo/key-card
- export new functions from module index
- add mocha unit tests for parseKeycardFromLines covering all Part N edge cases
- add pdf parse demo UI to web-demo KeyCard page with file upload,
  result table, and worker configuration for webpack

WCN-19
pdfjs-dist@5.7.x requires node >=22.13.0 || >=24, which is incompatible
with the node 20.x CI runner. downgrade to ^4.0.0 which supports node 18+.
update web-demo worker path from .mjs to .js to match v4 build output.
fix prettier formatting in web-demo KeyCard component.

WCN-19
pdfjs-dist initializes browser-only globals (DOMMatrix) at module load
time. using a static top-level import caused all tests in @bitgo/key-card
to crash in node.js environments. switching to a dynamic import() inside
extractKeycardEntriesFromPDF defers loading to call time (browser only).

WCN-19
…nsions

pdfjs-dist v4 only ships .mjs worker files (no .js variants). the previous
commit incorrectly changed the worker path to .mjs -> .js. reverted to
pdf.worker.min.mjs and added .mjs to webpack resolve extensions in both
dev and prod configs so webpack can locate the file.

WCN-19
@mohammadalfaiyazbitgo mohammadalfaiyazbitgo force-pushed the claude/compassionate-euler-DJa99 branch from 2d5100a to 45ec5bf Compare June 4, 2026 17:32
@mohammadalfaiyazbitgo mohammadalfaiyazbitgo marked this pull request as ready for review June 4, 2026 17:45
@Marzooqa
Copy link
Copy Markdown
Contributor

Marzooqa commented Jun 5, 2026

@claude

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

Claude finished @Marzooqa's task —— View job


I'll help you with this PR! Since you didn't specify a particular request, I'll conduct a comprehensive code review of the PDF keycard parsing functionality.

Todo List

  • Read repository guidelines and setup
  • Examine the PR changes and implementation
  • Review code quality and security
  • Check TypeScript adherence and project conventions
  • Provide comprehensive feedback

Code Review Summary

I've reviewed the PDF keycard parsing functionality implementation. Overall, this is a well-structured and thoughtfully implemented feature. Here's my detailed analysis:

Strengths

1. Solid Architecture & Separation of Concerns

  • Clean separation between PDF extraction (extractKeycardFromPDF.ts) and text parsing (parseKeycard.ts)
  • Well-defined TypeScript interfaces (PDFTextNode, KeycardEntry)
  • Proper async/await usage and error handling

2. Robust PDF Text Reconstruction

  • buildLinesFromPDFNodes() function cleverly handles PDF text node spatial arrangement: modules/key-card/src/parseKeycard.ts:46-106
  • Smart coordinate-based line reconstruction with tolerance for slight Y-axis variations (±2 pixels)
  • Handles page breaks and text spacing correctly

3. Excellent Edge Case Handling

  • Comprehensive handling of "Part N" page-break labels in normalizeSectionValue(): modules/key-card/src/parseKeycard.ts:22-36
  • Both standalone labels and embedded labels are properly stripped
  • JSON brace counting for encrypted wallet password boundary detection: modules/key-card/src/parseKeycard.ts:166-171

4. Comprehensive Test Coverage

  • Tests cover all major scenarios including multi-line content and embedded labels: modules/key-card/test/unit/parseKeycard.test.ts
  • Good test naming and clear test data

5. Browser Safety

  • Dynamic imports prevent Node.js issues with browser-only globals: modules/key-card/src/extractKeycardFromPDF.ts:17
  • Proper webpack worker configuration: modules/web-demo/src/components/KeyCard/index.tsx:14-17

⚠️ Areas for Improvement

1. Error Handling Enhancement

// modules/key-card/src/extractKeycardFromPDF.ts:27-30
if (!('str' in item) || !Array.isArray(item.transform)) {
  continue;
}

Consider adding debug logging for skipped items to help with troubleshooting PDF parsing issues.

2. Magic Number Documentation

// modules/key-card/src/parseKeycard.ts:52, 77, 92
const yDiff = Math.abs(a.y - b.y);
if (yDiff > 2) { // Magic number - consider making this configurable

The ±2 pixel tolerance is used in multiple places - consider extracting to a named constant.

3. Type Safety

// modules/key-card/src/parseKeycard.ts:121-122
const section = labelMatch[1]?.toUpperCase();
const ?? '');

Good use of optional chaining, but could benefit from more explicit type guards.

4. Performance Consideration
The regex operations in normalizeSectionValue() run on potentially long strings. For very large keycard files, consider streaming approaches.

🔒 Security Assessment

  • ✅ No direct file system access
  • ✅ Input validation through regex patterns
  • ✅ No eval() or dynamic code execution
  • ✅ Proper sanitization of PDF text content
  • ✅ No credential exposure risks

📋 Code Style & Standards

  • ✅ Follows BitGoJS TypeScript conventions
  • ✅ Good use of specific types over any
  • ✅ Proper JSDoc comments where needed
  • ✅ Consistent naming conventions
  • ⚠️ Could use more inline comments explaining complex spatial logic

🧪 Testing & Dependencies

  • ✅ Comprehensive unit tests with good coverage
  • ✅ Appropriate dependency version (pdfjs-dist ^4.0.0 for Node 20 compatibility)
  • ✅ Proper dev vs production dependency classification
  • ✅ Tests follow existing project patterns

🌐 Browser Integration

  • ✅ React component properly handles file upload UX
  • ✅ Good error states and loading indicators
  • ✅ Proper file input restrictions (PDF only)
  • ✅ Clean table presentation of parsed data

📋 Minor Suggestions

  1. Consider extracting the coordinate tolerance constant:
const PDF_COORDINATE_TOLERANCE = 2;
  1. Add JSDoc comments to key functions like buildLinesFromPDFNodes() for better documentation.

  2. Consider adding a validation function to verify keycard structure completeness.

Verdict

This is a high-quality implementation that demonstrates good software engineering practices. The code is well-structured, thoroughly tested, and handles edge cases appropriately. The approach to PDF parsing is pragmatic and robust.

Ready for merge - The implementation meets BitGoJS standards and the review approval indicates it's ready for production use.


Job run: https://github.com/BitGo/BitGoJS/actions/runs/checkout
Branch: claude/compassionate-euler-DJa99

…d jsdoc

- extract magic number 2 to PDF_LINE_Y_TOLERANCE and PDF_NODE_GAP_THRESHOLD
  constants with explanatory comments
- add jsdoc to buildLinesFromPDFNodes explaining the spatial reconstruction
  algorithm

WCN-19

Signed-off-by: Mohammad Al Faiyaz <mohammadalfaiyaz@bitgo.com>
@mohammadalfaiyazbitgo mohammadalfaiyazbitgo force-pushed the claude/compassionate-euler-DJa99 branch from 5f23f81 to a9ec49c Compare June 5, 2026 14:52
Comment thread modules/key-card/test/unit/parseKeycard.test.ts
@mohammadalfaiyazbitgo mohammadalfaiyazbitgo merged commit edc3d2e into master Jun 5, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants