[Test] Add integ-test for clustermgtd instance-ID matching under EC2 eventual consistency#7463
[Test] Add integ-test for clustermgtd instance-ID matching under EC2 eventual consistency#7463hehe7318 wants to merge 1 commit into
Conversation
…eventual consistency
Add test_clustermgtd_instance_id_matching, a regression test for the EC2
DescribeInstances eventual-consistency issue where instances are returned
with a missing PrivateIpAddress. clustermgtd must keep such instances,
match them to Slurm nodes by InstanceId, and leave the healthy nodes in
place instead of replacing them by IP matching.
The fault is simulated on the head node by setup_missing_private_ip_override.sh,
which wraps common.ec2_utils.get_private_ip_address_and_dns_name to raise
KeyError('PrivateIpAddress') for a chosen instance, exactly as a real
DescribeInstances response with a missing PrivateIpAddress would. The
wrapper is toggled via a trigger file and gated so the fault can be turned
on/off without reinstalling.
Also add an assertion in the scaling stress test to guard against the
'not all EC2 info are available' discard path reappearing at scale.
| trigger_file = "/tmp/ec2_eventual_consistency_instances" | ||
|
|
||
| partition = "queue1" | ||
| num_static_nodes = 2 |
There was a problem hiding this comment.
[minor/not-blocking] inject num_static_nodes into the cluster config as jinja variable so that you make sure that the number of nodes is always aligned. If in future we must change the number of nodes, we will only need to change that variable, rather than that variable + the config file.
| str(test_datadir / "setup_missing_private_ip_override.sh"), run_as_root=True | ||
| ) | ||
| remote_command_executor.run_remote_command(f"echo '{target_instance_id}' | sudo tee {trigger_file}") | ||
| remote_command_executor.run_remote_command("sudo systemctl restart supervisord") |
There was a problem hiding this comment.
Restarting supervisord would restart all the daemons it manages.
This is not necessayr. Better to keep this restart focused and restart only clustermgtd instead.
| remote_command_executor.run_remote_command(f"echo '{target_instance_id}' | sudo tee {trigger_file}") | ||
| remote_command_executor.run_remote_command("sudo systemctl restart supervisord") | ||
|
|
||
| retry(wait_fixed=seconds(20), stop_max_delay=minutes(5))(assert_lines_in_logs)( |
There was a problem hiding this comment.
A healthy restart would happen in seconds. Let's reduce stop_max_delay to 1 minute.
If it takes more than 1 minute to restart, then there is something wrong going on and we do not want it to be masked by a logn retry.
| ) | ||
|
|
||
| # Let clustermgtd run several more iterations; a regression would have replaced the node by now. | ||
| time.sleep(4 * loop_time) |
There was a problem hiding this comment.
Why 4 iterations? Is there some other evidence we can wait rather than using a magic number, e.g. a specific log line?
| # | ||
| # The wrapper is module-global (also used by fleet_manager) and toggled via the trigger file, which only | ||
| # ever lists already-running instance ids. clustermgtd does not hot-reload python, so the caller must | ||
| # restart supervisord once after this script; toggling the trigger file afterwards needs no restart. |
There was a problem hiding this comment.
[minor] it is not required to restart supervisord. it is only required to restart clustermgtd.
| ${MARKER} | ||
| import os as _ec_os # noqa: E402 | ||
|
|
||
| _ec_original_get_private_ip_address_and_dns_name = get_private_ip_address_and_dns_name |
There was a problem hiding this comment.
what does the _ec_ prefix mean?
| return _ec_original_get_private_ip_address_and_dns_name(instance_info) | ||
| EOF | ||
|
|
||
| python3 -c "import ast, sys; ast.parse(open(sys.argv[1]).read())" "${tmp_file}" |
There was a problem hiding this comment.
[minor] We execute this line to validate that the resulting patched script is syntactically valid. That's a very good approach. Let's add a comment/echo describing what we are doing to clarify as it is not obvious.
| remote_command_executor.clear_clustermgtd_log() | ||
|
|
||
| # Restart is required because clustermgtd does not hot-reload python code. | ||
| remote_command_executor.run_remote_script( |
There was a problem hiding this comment.
Let's make sure that the script gives an evidence of its progress. If the script fails, we want from the test logs to be clear where and why
Description of changes
Add test_clustermgtd_instance_id_matching, a regression test for the EC2 DescribeInstances eventual-consistency issue where instances are returned with a missing PrivateIpAddress. clustermgtd must keep such instances, match them to Slurm nodes by InstanceId, and leave the healthy nodes in place instead of replacing them by IP matching.
The fault is simulated on the head node by setup_missing_private_ip_override.sh, which wraps common.ec2_utils.get_private_ip_address_and_dns_name to raise KeyError('PrivateIpAddress') for a chosen instance, exactly as a real DescribeInstances response with a missing PrivateIpAddress would. The wrapper is toggled via a trigger file and gated so the fault can be turned on/off without reinstalling.
Also add an assertion in the scaling stress test to guard against the 'not all EC2 info are available' discard path reappearing at scale.
Tests
References
Checklist
developadd the branch name as prefix in the PR title (e.g.[release-3.6]).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.