Skip to content

[integ-tests-framework] Make capacity reservations for all instance types#7461

Merged
hanwen-cluster merged 3 commits into
aws:developfrom
hanwen-cluster:developjun29
Jul 1, 2026
Merged

[integ-tests-framework] Make capacity reservations for all instance types#7461
hanwen-cluster merged 3 commits into
aws:developfrom
hanwen-cluster:developjun29

Conversation

@hanwen-cluster

@hanwen-cluster hanwen-cluster commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Description of changes

  1. With [integ-tests] Improve test_proxy to avoid insufficient capacity error #7440, we started to make capacity reservations for {"c5.xlarge", "m6g.xlarge", "m6i.xlarge"}, and use other similar instance types if a capacity reservation fails to creat. This commit expands the logic to all instance types.
    1.1. With instance types <= .xlarge, we make duplicate capacity reservations because multiple tests in parallel could use the same instance types, therefore need multiple capacity reservations. With instance types >.xlarge, we make only one capacity reservation because tests with larger instance types usually make capacity reservations early in the test definition (e.g. test_efa in commercial makes capacity reservation in develop.yaml), therefore this second layer of capacity reservation shouldn't make duplicate capacity reservations.
    1.2. With instance types supporting EFA, create the capacity reservation in a placement group. With instance types not supporting EFA, create the capacity reservation without a placement group.
  2. With this commit, resolve_instance_with_capacity allows specifying alternative_instance_types. Prior to this commit alternative_instance_types was always calculated with get_similar_instance_types, which could be too restrictive, so don't give too many alternatives for instance types like c5n.18xlarge
  3. Improve test_efa in isolated_regions to take a flag to use any efa instances to avoid Insufficient Capacity Error. test_efa in commercial doesn't need this, because it could try out different regions. In isolated regions, the test has to run in a specific region.

Tests

test-suites:
  efa:
    test_efa.py::test_efa:
      dimensions:
        - regions: ["ap-southeast-5"]
          instances: ["c5n.18xlarge"]
          oss: ["alinux2023"]
          schedulers: ["slurm"]
          flags: ["any-efa-instances"]
        - regions: ["us-east-1"]
          instances: ["c5n.18xlarge"]
          oss: ["alinux2023"]
          schedulers: ["slurm"]

In the above tests, the test in us-east-1 passed completely. The test in ap-southeast-5 failed some checks in fabtest because it was using g6.8xlarge. This failure is not a regression from this PR, and won't surface in isolated regions because fabtest is not run in isolated regions.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hanwen-cluster hanwen-cluster requested review from a team as code owners June 29, 2026 21:05
@hanwen-cluster hanwen-cluster added the skip-changelog-update Disables the check that enforces changelog updates in PRs label Jun 29, 2026
Comment thread tests/integration-tests/tests/common/capacity_helpers.py
Comment thread tests/integration-tests/tests/common/capacity_helpers.py
return False


def get_efa_instance_types(region):

@hehe7318 hehe7318 Jun 30, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As previously discussed, let's add a filter here to restrict ARM instance types.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

…ypes

1. With aws#7440, we started to make capacity reservations for {"c5.xlarge", "m6g.xlarge", "m6i.xlarge"}, and use other similar instance types if a capacity reservation fails to creat. This commit expands the logic to all instance types.
1.1. With instance types <= .xlarge, we make duplicate capacity reservations because multiple tests in parallel could use the same instance types, therefore need multiple capacity reservations. With instance types >.xlarge, we make only one capacity reservation because tests with larger instance types usually make capacity reservations early in the test definition (e.g. test_efa in commercial makes capacity reservation in `develop.yaml`), therefore this second layer of capacity reservation shouldn't make duplicate capacity reservations.
1.2. With instance types supporting EFA, create the capacity reservation in a placement group. With instance types not supporting EFA, create the capacity reservation without a placement group.
2. With this commit, resolve_instance_with_capacity allows specifying alternative_instance_types. Prior to this commit alternative_instance_types was always calculated with `get_similar_instance_types`, which could be too restrictive, so don't give too many alternatives for instance types like `c5n.18xlarge`
3. Improve test_efa in isolated_regions to take a flag to use any efa instances to avoid Insufficient Capacity Error. test_efa in commercial doesn't need this, because it could try out different regions. In isolated regions, the test has to run in a specific region.
@hanwen-cluster hanwen-cluster enabled auto-merge (rebase) July 1, 2026 13:10
@hanwen-cluster hanwen-cluster merged commit 723ebd6 into aws:develop Jul 1, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog-update Disables the check that enforces changelog updates in PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants