Skip to content

[FEA]: pathfinder "all_must_work" alert mechanism #1943

@rwgk

Description

@rwgk

Problem

The pathfinder all_must_work tests can fail in CI due to environment differences (e.g., MCDM driver mode, missing DLLs, package version mismatches) that are difficult to reproduce locally. Currently, these failures block the entire CI workflow.

We want to surface pathfinder issues without blocking PRs, while ensuring they don't go unnoticed.

Motivation

Adding a dedicated .github/workflows/pathfinder.yml with its own matrix would:

  • Require significant additional CI code
  • Add long-term maintenance burden
  • Duplicate matrix definitions across workflows

We need a lighter-weight alternative that still provides visibility into pathfinder health.

Proposed Solution

Use a "sentinel issue" pattern: a single GitHub issue that tracks pathfinder all_must_work failures over time.

Behavior

  1. Create a tracking issue: "pathfinder all_must_work alert"
  2. When all_must_work fails in CI:
    • Post a comment to the issue with job details (run link, commit, PR, etc.)
    • Reopen the issue if it's currently closed
  3. The test step uses continue-on-error: true so it doesn't block the workflow

Benefits

  • Single source of truth - all failures tracked in one place with history
  • Automatic notifications - (only) maintainers subscribed to the issue get alerts
  • Low maintenance - no new workflow files or matrices to maintain
  • Self-healing - closing the issue signals "all clear"; it auto-reopens on failure
  • Actionable - issue can be assigned, labeled, added to projects

Implementation Sketch

- name: Pathfinder all_must_work
  continue-on-error: true
  id: pathfinder
  run: run-tests pathfinder-strict

- name: Report pathfinder failure to tracking issue
  if: always() && steps.pathfinder.outcome == 'failure'
  env:
    GH_TOKEN: ${{ github.token }}
  run: |
    ISSUE_TITLE="Pathfinder all_must_work alert"
    
    # Find existing issue by title (searches both open and closed)
    ISSUE_NUMBER=$(gh issue list --search "in:title \"$ISSUE_TITLE\"" --state all --json number --jq '.[0].number')
    
    if [ -z "$ISSUE_NUMBER" ]; then
      # Create the tracking issue if it doesn't exist
      ISSUE_NUMBER=$(gh issue create \
        --title "$ISSUE_TITLE" \
        --body "Tracking issue for pathfinder all_must_work CI failures. This issue is automatically reopened when failures occur." \
        --label "alert,pathfinder" \
        --json number --jq '.number')
    fi
    
    # Post comment with failure details
    gh issue comment "$ISSUE_NUMBER" --body "## ⚠️ Failure detected

| Field | Value |
|-------|-------|
| **Run** | ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }} |
| **Job** | ${{ github.job }} |
| **Commit** | \`${{ github.sha }}\` |
| **Branch** | \`${{ github.ref_name }}\` |
| **PR** | ${{ github.event.pull_request.html_url || 'N/A' }} |
| **Actor** | @${{ github.actor }} |
"
    
    # Reopen if currently closed
    gh issue reopen "$ISSUE_NUMBER" 2>/dev/null || true

Optional: Recovery Notification

- name: Report pathfinder recovery
  if: always() && steps.pathfinder.outcome == 'success'
  env:
    GH_TOKEN: ${{ github.token }}
  run: |
    # Only comment if issue is open (indicating recent failure)
    ISSUE_NUMBER=$(gh issue list --search "in:title \"Pathfinder all_must_work alert\"" --state open --json number --jq '.[0].number')
    if [ -n "$ISSUE_NUMBER" ]; then
      gh issue comment "$ISSUE_NUMBER" --body "✅ Passing again as of ${{ github.sha }}"
    fi

Alternatives Considered

Approach Pros Cons
Dedicated workflow + matrix Full control, separate status High maintenance, duplicated config
::warning:: annotations only Simple Easy to miss, no history
PR labels Filterable No notifications, manual cleanup
PR comments Visible per-PR No aggregate view, noisy
Sentinel issue (proposed) Central tracking, notifications, history Comments accumulate

Requirements

  • issues: write permission in the workflow

Metadata

Metadata

Assignees

Labels

CI/CDCI/CD infrastructureP1Medium priority - Should docuda.pathfinderEverything related to the cuda.pathfinder module

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions