Skip to content

[Test] Add integ-test for clustermgtd instance-ID matching under EC2 eventual consistency#7463

Open
hehe7318 wants to merge 1 commit into
aws:developfrom
hehe7318:wip/test-instance-id-matching
Open

[Test] Add integ-test for clustermgtd instance-ID matching under EC2 eventual consistency#7463
hehe7318 wants to merge 1 commit into
aws:developfrom
hehe7318:wip/test-instance-id-matching

Conversation

@hehe7318

Copy link
Copy Markdown
Contributor

Description of changes

Add test_clustermgtd_instance_id_matching, a regression test for the EC2 DescribeInstances eventual-consistency issue where instances are returned with a missing PrivateIpAddress. clustermgtd must keep such instances, match them to Slurm nodes by InstanceId, and leave the healthy nodes in place instead of replacing them by IP matching.

The fault is simulated on the head node by setup_missing_private_ip_override.sh, which wraps common.ec2_utils.get_private_ip_address_and_dns_name to raise KeyError('PrivateIpAddress') for a chosen instance, exactly as a real DescribeInstances response with a missing PrivateIpAddress would. The wrapper is toggled via a trigger file and gated so the fault can be turned on/off without reinstalling.

Also add an assertion in the scaling stress test to guard against the 'not all EC2 info are available' discard path reappearing at scale.

Tests

  • Describe the automated and/or manual tests executed to validate the patch.
  • Describe the added/modified tests.

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…eventual consistency

Add test_clustermgtd_instance_id_matching, a regression test for the EC2
DescribeInstances eventual-consistency issue where instances are returned
with a missing PrivateIpAddress. clustermgtd must keep such instances,
match them to Slurm nodes by InstanceId, and leave the healthy nodes in
place instead of replacing them by IP matching.

The fault is simulated on the head node by setup_missing_private_ip_override.sh,
which wraps common.ec2_utils.get_private_ip_address_and_dns_name to raise
KeyError('PrivateIpAddress') for a chosen instance, exactly as a real
DescribeInstances response with a missing PrivateIpAddress would. The
wrapper is toggled via a trigger file and gated so the fault can be turned
on/off without reinstalling.

Also add an assertion in the scaling stress test to guard against the
'not all EC2 info are available' discard path reappearing at scale.
@hehe7318 hehe7318 requested review from a team as code owners June 30, 2026 20:15
trigger_file = "/tmp/ec2_eventual_consistency_instances"

partition = "queue1"
num_static_nodes = 2

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor/not-blocking] inject num_static_nodes into the cluster config as jinja variable so that you make sure that the number of nodes is always aligned. If in future we must change the number of nodes, we will only need to change that variable, rather than that variable + the config file.

str(test_datadir / "setup_missing_private_ip_override.sh"), run_as_root=True
)
remote_command_executor.run_remote_command(f"echo '{target_instance_id}' | sudo tee {trigger_file}")
remote_command_executor.run_remote_command("sudo systemctl restart supervisord")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restarting supervisord would restart all the daemons it manages.
This is not necessayr. Better to keep this restart focused and restart only clustermgtd instead.

remote_command_executor.run_remote_command(f"echo '{target_instance_id}' | sudo tee {trigger_file}")
remote_command_executor.run_remote_command("sudo systemctl restart supervisord")

retry(wait_fixed=seconds(20), stop_max_delay=minutes(5))(assert_lines_in_logs)(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A healthy restart would happen in seconds. Let's reduce stop_max_delay to 1 minute.
If it takes more than 1 minute to restart, then there is something wrong going on and we do not want it to be masked by a logn retry.

)

# Let clustermgtd run several more iterations; a regression would have replaced the node by now.
time.sleep(4 * loop_time)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 4 iterations? Is there some other evidence we can wait rather than using a magic number, e.g. a specific log line?

#
# The wrapper is module-global (also used by fleet_manager) and toggled via the trigger file, which only
# ever lists already-running instance ids. clustermgtd does not hot-reload python, so the caller must
# restart supervisord once after this script; toggling the trigger file afterwards needs no restart.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] it is not required to restart supervisord. it is only required to restart clustermgtd.

${MARKER}
import os as _ec_os # noqa: E402

_ec_original_get_private_ip_address_and_dns_name = get_private_ip_address_and_dns_name

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the _ec_ prefix mean?

return _ec_original_get_private_ip_address_and_dns_name(instance_info)
EOF

python3 -c "import ast, sys; ast.parse(open(sys.argv[1]).read())" "${tmp_file}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] We execute this line to validate that the resulting patched script is syntactically valid. That's a very good approach. Let's add a comment/echo describing what we are doing to clarify as it is not obvious.

remote_command_executor.clear_clustermgtd_log()

# Restart is required because clustermgtd does not hot-reload python code.
remote_command_executor.run_remote_script(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure that the script gives an evidence of its progress. If the script fails, we want from the test logs to be clear where and why

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants