Skip to content

[release/9.0] Surface scheduled outerloop Helix work item failures (backport of #129049, #129629)#129908

Open
mmitche wants to merge 2 commits into
dotnet:release/9.0from
mmitche:backport/outerloop-helix-warnings-release-9.0
Open

[release/9.0] Surface scheduled outerloop Helix work item failures (backport of #129049, #129629)#129908
mmitche wants to merge 2 commits into
dotnet:release/9.0from
mmitche:backport/outerloop-helix-warnings-release-9.0

Conversation

@mmitche

@mmitche mmitche commented Jun 26, 2026

Copy link
Copy Markdown
Member

Backport of #129049 and #129629 to release/9.0.

Combines both changes:

Conflicts: helix.yml parameters list (SuperPmi params present on this branch) was resolved by keeping both.

Note

This pull request was authored with the assistance of GitHub Copilot.

mmitche and others added 2 commits June 26, 2026 13:13
…otnet#129049)

> [!NOTE]
> This pull request was authored with the assistance of GitHub Copilot.

Several scheduled outerloop pipelines (the `outerloop.yml` family:
`runtime-libraries-coreclr outerloop` and its `-windows`/`-linux`/`-osx`
variants) use an `always: false` scheduled trigger. With `always:
false`, AzDO only starts a new scheduled run if the source changed
**since the last _successful_ scheduled run**.

Because the repo has many flaky outerloop tests, the Helix test work
items virtually always have at least one failure, which fails the "Send
to Helix" step and therefore the whole build. The build never reaches a
`succeeded` state, so AzDO re-queues **the same, unchanged commit** day
after day, submitting more and more Helix work for no benefit.
(Empirically confirmed: a single commit was re-run and failed for 19
consecutive days; once a sibling definition produced a genuinely
successful run, the same-SHA re-queue stopped.)

`continueOnError: true` only downgrades the build to
`partiallySucceeded`, which AzDO's `always: false` scheduler still does
**not** treat as successful — so the same commit keeps getting
re-queued. The Helix step must end **fully successful** (exit 0).

Make the "Send to Helix" step actually succeed on scheduled runs by
disabling the two Arcade `Microsoft.DotNet.Helix.Sdk` properties that
fail the build (both default to `true`):

- **`FailOnWorkItemFailure`** — `CheckHelixJobStatus` errors when a work
item exits non-zero.
- **`FailOnTestFailure`** — `CheckAzurePipelinesTestResults` errors when
any published test failed.

Setting both to `false` lets the msbuild step exit 0, producing a fully
`succeeded` build. Failed tests are still published and visible in the
test results tab; AzDO does not auto-degrade a build to
`partiallySucceeded` just because a published test run contains failures
— only a failing task would.

- **`eng/pipelines/libraries/helix.yml`**: Added a `failOnTestFailures`
parameter (default `true`, preserving today's behavior) wired to
`/p:FailOnWorkItemFailure` and `/p:FailOnTestFailure` on the Send to
Helix msbuild invocation.
- **`eng/pipelines/libraries/outerloop.yml`**: Passes
`failOnTestFailures: false` **only on scheduled runs** (`Build.Reason ==
'Schedule'`) for all three matrix legs (Release, Debug, NET48).

The new parameter defaults to `true`, so all other `helix.yml` callers
are unaffected (none set `WaitForWorkItemCompletion` or these properties
on this path, so they already resolve to `true`). Only scheduled
outerloop runs change behavior. PR / rolling / manual outerloop runs
continue to fail on Helix failures exactly as before. Build/compile
breaks still fail scheduled runs (this only affects the Helix step).

On scheduled runs, `FailOnWorkItemFailure=false` also masks work-item
crashes/timeouts/infra failures, not just test-assertion failures. This
is an accepted tradeoff for the goal of stopping the wasteful daily
re-queue of unchanged commits; results remain visible in the Helix/test
reporting.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…net#129629)

## Problem

PR dotnet#129049 made scheduled outerloop builds succeed when only Helix tests
fail, by setting `FailOnWorkItemFailure`/`FailOnTestFailure` to `false`
on scheduled runs (via the `failOnTestFailures: false` parameter). This
stopped AzDO's `always: false` scheduler from re-queueing the same
commit day after day.

The side effect: failed Helix work items became **completely invisible**
in the Azure DevOps timeline. The `Send to Helix` step is fully green,
so there is no signal that work items failed (even though, for flaky
outerloop, they almost always do).

## Fix

Surface failed work items as **warnings** instead of silently dropping
them. Warnings keep the failures visible in the timeline but do **not**
degrade the build below `succeeded` (so the `always: false` re-queue fix
from dotnet#129049 is preserved).

- **`src/libraries/sendtohelixhelp.proj`**: new
`WarnOnHelixWorkItemFailure` target (`AfterTargets=CheckHelixJobStatus`)
that emits a `<Warning>` for each failed `@(CompletedWorkItem)` when
`WarnOnHelixTestFailure=true`. This mirrors what the Arcade SDK's
`CheckHelixJobStatus` would have *errored* on, but as a warning.
- **`eng/pipelines/libraries/helix.yml`**: new `warnOnTestFailures`
parameter (default `false`) wired to `/p:WarnOnHelixTestFailure`.
- **`eng/pipelines/libraries/outerloop.yml`**: scheduled runs now set
`warnOnTestFailures: true` alongside `failOnTestFailures: false` on all
three legs.

No warn-as-error change was needed: the `Send to Helix` step already
runs with warnaserror disabled (`_warnAsErrorParamHelixOverride`), so
these warnings are not promoted back into build-failing errors.

## Validation

Ran the `runtime-libraries-coreclr outerloop` pipeline (dnceng-public
def 125, [build
1472840](https://dev.azure.com/dnceng-public/public/_build/results?buildId=1472840))
with a temporary Manual gate. Multiple CoreCLR_Release legs completed
**succeeded** with failed work items surfaced as warnings and **zero
errors**, e.g.:

```
src/libraries/sendtohelixhelp.proj(364,5): warning : Work item System.Runtime.Numerics.Tests in job 2e01f1b1-... has failed. Failure log: https://helix.dot.net/api/.../console
```

Legs whose work items all passed produced no such warning, as expected.

> [!NOTE]
> This pull request was authored with the assistance of GitHub Copilot.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dotnet-policy-service

Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @dotnet/area-infrastructure-libraries
See info in area-owners.md if you want to be subscribed.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Backports updates to the release/9.0 outerloop Helix pipeline behavior so that scheduled outerloop runs (always: false) no longer fail the build due to Helix work item/test failures (preventing Azure DevOps from re-queuing the same commit), while still surfacing those failures as timeline warnings for visibility.

Changes:

  • Add an MSBuild target that emits warnings for failed Helix work items when explicitly enabled (WarnOnHelixTestFailure=true).
  • Introduce failOnTestFailures and warnOnTestFailures parameters in the Helix pipeline template and wire them to Helix SDK properties.
  • Update scheduled outerloop runs to set failOnTestFailures: false and warnOnTestFailures: true.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
src/libraries/sendtohelixhelp.proj Adds an AfterTargets=CheckHelixJobStatus target to surface failed work items as MSBuild warnings when opted in.
eng/pipelines/libraries/outerloop.yml On scheduled runs only, disables failing on Helix failures and enables warning surfacing for all outerloop matrix legs.
eng/pipelines/libraries/helix.yml Adds parameters to control Helix failure behavior and passes the corresponding MSBuild properties to sendtohelix.proj.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mmitche mmitche requested a review from lewing June 26, 2026 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants