Skip to content

POC: Time-based Helix test scheduling with AzDO history#54939

Draft
MichaelSimons wants to merge 13 commits into
mainfrom
michaelsimons/helix-time-based-scheduler
Draft

POC: Time-based Helix test scheduling with AzDO history#54939
MichaelSimons wants to merge 13 commits into
mainfrom
michaelsimons/helix-time-based-scheduler

Conversation

@MichaelSimons

Copy link
Copy Markdown
Member

Summary

Proof-of-concept replacing the SDK's count-based Helix test partitioning with time-based scheduling, inspired by dotnet/roslyn's AssemblyScheduler. Uses historical test execution times from Azure DevOps to create work items targeting ~10 minutes each, at the individual test method level.

What's Changed

New Files

  • test/HelixTasks/AzdoClient.cs — Lightweight AzDO REST client (builds + test results APIs)
  • test/HelixTasks/TestHistoryManager.cs — Fetches per-test-method duration history from last successful CI build, with fallback to main
  • test/HelixTasks/TestMethodDiscovery.cs — Discovers individual test methods from PE metadata via reflection
  • test/HelixTasks/TimeBasedScheduler.cs — Greedy first-fit bin-packing scheduler (10-min target per work item)
  • test/HelixTasks.SchedulerTool/ — Local console app for validating scheduling plans offline

Modified Files

  • test/HelixTasks/SDKCustomCreateXUnitWorkItemsWithTestExclusion.cs — Added UseTimeBasedScheduling mode with direct vstest.console.dll invocation via RSP files
  • test/xunit-runner/XUnitRunner.targets — Passes time-based scheduling properties to MSBuild task
  • test/UnitTests.proj — Auto-configures from AzDO pipeline variables, enables time-based scheduling by default

Design

  • Scheduling: Greedy first-fit bin-packing using historical execution times from AzDO REST API
  • Fallback: Count-based partitioning (25 work items) when no history is available
  • Test invocation: dotnet exec vstest.console.dll @workitem.rsp — all arguments (assembly, loggers, blame, filter) in a response file read natively by vstest.console.dll, eliminating all command-line length constraints
  • Windows: Uses .cmd batch scripts for correct variable expansion
  • Branch resolution: Queries AzDO for history on the PR target branch, falls back to main
  • Parallel execution disabled: Temporarily disabled to stabilize test results and reduce noise from concurrency-related intermittent failures during validation

MichaelSimons and others added 11 commits June 18, 2026 17:15
Adds time-based work item scheduling inspired by dotnet/roslyn's
AssemblyScheduler. Instead of partitioning by method count, this uses
historical test execution durations from Azure DevOps to create Helix
work items targeting ~10 minutes each at the individual test method level.

New files:
- AzdoClient.cs: Lightweight REST client for AzDO builds/test results API
- TestHistoryManager.cs: Fetches per-test duration history from last
  successful CI build, with branch fallback
- TestMethodDiscovery.cs: Discovers individual test methods from compiled
  assemblies using reflection metadata
- TimeBasedScheduler.cs: Greedy first-fit bin-packing scheduler with
  configurable target time, command-line length limits, and count-based
  fallback when history is unavailable
- HelixTasks.SchedulerTool/: Local console app for validating scheduling
  plans without running in CI

Modified:
- SDKCustomCreateXUnitWorkItemsWithTestExclusion.cs: Added
  UseTimeBasedScheduling mode with AzDO parameters, integrated
  time-based scheduling path alongside existing count-based approach
- HelixTasks.csproj: Added System.Text.Json, InternalsVisibleTo

The existing count-based scheduling is preserved as the default and
serves as fallback when history is unavailable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update XUnitRunner.targets to pass all new time-based scheduling
properties to SDKCustomCreateXUnitWorkItemsWithTestExclusion.

Add auto-configuration in UnitTests.proj using AzDO built-in variables:
- AzdoProjectUri: derived from SYSTEM_COLLECTIONURI + SYSTEM_TEAMPROJECT
- AzdoAccessToken: from SYSTEM_ACCESSTOKEN (already mapped in sdk-build.yml)
- AzdoDefinitionId: from SYSTEM_DEFINITIONID
- AzdoTargetBranch: from SYSTEM_PULLREQUEST_TARGETBRANCH (falls back to main)

To enable: set UseTimeBasedScheduling=true in the pipeline or UnitTests.proj.
All other config is auto-derived from the pipeline environment.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The method-level filter strings (FullyQualifiedName per test) are much
longer than the old class-level filters. On Windows, cmd.exe has an
8191-character command line limit, so many work items were failing with
'The input line is too long' (exit code 255).

Fix: Make MaxFilterLength OS-aware:
- Windows: 7000 chars (leaving ~1200 for the command prefix)
- POSIX: 25000 chars (bash supports ~128KB+)

Also enforce the filter length limit in the count-based fallback path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Instead of passing the method-level --filter on the command line (which
hits the 8191-char cmd.exe limit on Windows), write each work item's
filter to a .rsp response file in the publish directory and reference
it via @file.rsp on the command line.

This is the same approach used by dotnet/roslyn's Helix test runner.
The filter string can now be arbitrarily long, so work items are sized
purely by time budget (or count-based fallback), not constrained by
command-line length.

The TimeBasedScheduler's MaxFilterLength is now set to 100K (effectively
unlimited) since the rsp file has no length constraint.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
With filters in response files, work item sizing is purely driven by
the time budget (or count for fallback). Remove all filter-length
tracking and the isPosixShell parameter from TimeBasedScheduler.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Instead of 'dotnet test @filter.rsp' (which expands the RSP and hits
the CreateProcess 32K limit), invoke vstest.console.dll directly:

  dotnet exec vstest.console.dll @workitem.rsp

The RSP file contains ALL arguments (assembly, loggers, blame, filter)
and vstest.console.dll reads it natively without spawning a child
process — completely eliminating any command-line length constraint.

This matches the approach used by dotnet/roslyn's Helix test runner.

MTP projects continue to use dotnet exec with the test assembly directly
since they already handle arguments without the CreateProcess issue.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cmd.exe expands %variables% at parse time, so 'set /p var=<file&&
dotnet exec %var%' expands %var% to empty string before set runs.

Fix: write a .cmd batch script to the payload directory where each line
is parsed independently. The Helix command is just the script filename.
POSIX continues to use inline commands since \ is evaluated at runtime.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
xUnit: set parallelizeAssembly and parallelizeTestCollections to false
MSTest: set MSTestParallelizeWorkers to 1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Setting MSTestParallelizeWorkers=1 still causes MSTest targets to inject
[Parallelize], which conflicts with [DoNotParallelize] attributes in
several test projects. Setting scope to None prevents the attribute from
being generated entirely, and is compatible with projects that already
set MSTestParallelizeScope=None locally.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
MichaelSimons and others added 2 commits June 23, 2026 16:07
On Windows Helix machines, DOTNET_ROOT may point to a system-installed
.NET SDK with an incompatible (older) vstest.console.dll. This caused
MissingMethodException crashes in all non-MTP test work items.

Use HELIX_CORRELATION_PAYLOAD/d instead, which always contains the
custom-built SDK matching the test assemblies.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bare method names with exact-match filter missed [Theory]/[InlineData]
test cases whose FQN includes parameters (e.g. Method(arg1, arg2)).
Using 'FullyQualifiedName~Method' (contains) ensures all parameterized
variants are matched, resolving ~2,800 missing tests per leg.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant