[Test] Add integ-test for clustermgtd instance-ID matching under EC2 eventual consistency by hehe7318 · Pull Request #7463 · aws/aws-parallelcluster

hehe7318 · 2026-06-30T20:15:34Z

Description of changes

Add test_clustermgtd_instance_id_matching, a regression test for the EC2 DescribeInstances eventual-consistency issue where instances are returned with a missing PrivateIpAddress. clustermgtd must keep such instances, match them to Slurm nodes by InstanceId, and leave the healthy nodes in place instead of replacing them by IP matching.

The fault is simulated on the head node by setup_missing_private_ip_override.sh, which wraps common.ec2_utils.get_private_ip_address_and_dns_name to raise KeyError('PrivateIpAddress') for a chosen instance, exactly as a real DescribeInstances response with a missing PrivateIpAddress would. The wrapper is toggled via a trigger file and gated so the fault can be turned on/off without reinstalling.

Also add an assertion in the scaling stress test to guard against the 'not all EC2 info are available' discard path reappearing at scale.

Tests

Describe the automated and/or manual tests executed to validate the patch.
Describe the added/modified tests.

References

Link to impacted open issues.
Link to related PRs in other packages (i.e. cookbook, node).
Link to documentation useful to understand the changes.

Checklist

Make sure you are pointing to the right branch.
If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
Check all commits' messages are clear, describing what and why vs how.
Make sure to have added unit tests or integration tests to cover the new/modified code.
Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…eventual consistency Add test_clustermgtd_instance_id_matching, a regression test for the EC2 DescribeInstances eventual-consistency issue where instances are returned with a missing PrivateIpAddress. clustermgtd must keep such instances, match them to Slurm nodes by InstanceId, and leave the healthy nodes in place instead of replacing them by IP matching. The fault is simulated on the head node by setup_missing_private_ip_override.sh, which wraps common.ec2_utils.get_private_ip_address_and_dns_name to raise KeyError('PrivateIpAddress') for a chosen instance, exactly as a real DescribeInstances response with a missing PrivateIpAddress would. The wrapper is toggled via a trigger file and gated so the fault can be turned on/off without reinstalling. Also add an assertion in the scaling stress test to guard against the 'not all EC2 info are available' discard path reappearing at scale.

gmarciani · 2026-07-01T21:28:45Z

+    trigger_file = "/tmp/ec2_eventual_consistency_instances"
+
+    partition = "queue1"
+    num_static_nodes = 2


[minor/not-blocking] inject num_static_nodes into the cluster config as jinja variable so that you make sure that the number of nodes is always aligned. If in future we must change the number of nodes, we will only need to change that variable, rather than that variable + the config file.

gmarciani · 2026-07-01T21:31:59Z

+        str(test_datadir / "setup_missing_private_ip_override.sh"), run_as_root=True
+    )
+    remote_command_executor.run_remote_command(f"echo '{target_instance_id}' | sudo tee {trigger_file}")
+    remote_command_executor.run_remote_command("sudo systemctl restart supervisord")


Restarting supervisord would restart all the daemons it manages.
This is not necessayr. Better to keep this restart focused and restart only clustermgtd instead.

gmarciani · 2026-07-01T21:33:36Z

+    remote_command_executor.run_remote_command(f"echo '{target_instance_id}' | sudo tee {trigger_file}")
+    remote_command_executor.run_remote_command("sudo systemctl restart supervisord")
+
+    retry(wait_fixed=seconds(20), stop_max_delay=minutes(5))(assert_lines_in_logs)(


A healthy restart would happen in seconds. Let's reduce stop_max_delay to 1 minute.
If it takes more than 1 minute to restart, then there is something wrong going on and we do not want it to be masked by a logn retry.

gmarciani · 2026-07-01T21:35:57Z

+    )
+
+    # Let clustermgtd run several more iterations; a regression would have replaced the node by now.
+    time.sleep(4 * loop_time)


Why 4 iterations? Is there some other evidence we can wait rather than using a magic number, e.g. a specific log line?

gmarciani · 2026-07-01T21:42:45Z

+#
+# The wrapper is module-global (also used by fleet_manager) and toggled via the trigger file, which only
+# ever lists already-running instance ids. clustermgtd does not hot-reload python, so the caller must
+# restart supervisord once after this script; toggling the trigger file afterwards needs no restart.


[minor] it is not required to restart supervisord. it is only required to restart clustermgtd.

gmarciani · 2026-07-01T21:44:14Z

+${MARKER}
+import os as _ec_os  # noqa: E402
+
+_ec_original_get_private_ip_address_and_dns_name = get_private_ip_address_and_dns_name


what does the _ec_ prefix mean?

gmarciani · 2026-07-01T21:51:29Z

+    return _ec_original_get_private_ip_address_and_dns_name(instance_info)
+EOF
+
+python3 -c "import ast, sys; ast.parse(open(sys.argv[1]).read())" "${tmp_file}"


[minor] We execute this line to validate that the resulting patched script is syntactically valid. That's a very good approach. Let's add a comment/echo describing what we are doing to clarify as it is not obvious.

gmarciani · 2026-07-01T21:53:18Z

+    remote_command_executor.clear_clustermgtd_log()
+
+    # Restart is required because clustermgtd does not hot-reload python code.
+    remote_command_executor.run_remote_script(


Let's make sure that the script gives an evidence of its progress. If the script fails, we want from the test logs to be clear where and why

hehe7318 requested review from a team as code owners June 30, 2026 20:15

gmarciani reviewed Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Test] Add integ-test for clustermgtd instance-ID matching under EC2 eventual consistency#7463

[Test] Add integ-test for clustermgtd instance-ID matching under EC2 eventual consistency#7463
hehe7318 wants to merge 1 commit into
aws:developfrom
hehe7318:wip/test-instance-id-matching

hehe7318 commented Jun 30, 2026

Uh oh!

gmarciani Jul 1, 2026

Uh oh!

gmarciani Jul 1, 2026

Uh oh!

gmarciani Jul 1, 2026

Uh oh!

gmarciani Jul 1, 2026

Uh oh!

gmarciani Jul 1, 2026

Uh oh!

gmarciani Jul 1, 2026

Uh oh!

gmarciani Jul 1, 2026

Uh oh!

gmarciani Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

hehe7318 commented Jun 30, 2026

Description of changes

Tests

References

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants