-
Notifications
You must be signed in to change notification settings - Fork 88
[Node] Reset inactive partition nodes on fleet start to ensure recovering from protected mode #707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
bfaff17
f3b3f08
c33aa5d
a167842
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,7 +19,11 @@ | |
| from logging.config import fileConfig | ||
|
|
||
| from botocore.config import Config | ||
| from common.schedulers.slurm_commands import resume_powering_down_nodes, update_all_partitions | ||
| from common.schedulers.slurm_commands import ( | ||
| reset_nodes_in_inactive_partitions, | ||
| resume_powering_down_nodes, | ||
| update_all_partitions, | ||
| ) | ||
| from slurm_plugin.clustermgtd import ComputeFleetStatus, ComputeFleetStatusManager | ||
| from slurm_plugin.common import log_exception | ||
| from slurm_plugin.instance_manager import InstanceManager | ||
|
|
@@ -91,6 +95,7 @@ def _manage_fleet_status_transition(config, computefleet_status_data_path): | |
|
|
||
| def _start_partitions(): | ||
| log.info("Setting slurm partitions to UP and resuming nodes...") | ||
| reset_nodes_in_inactive_partitions() | ||
| update_all_partitions(PartitionStatus.UP, reset_node_addrs_hostname=False) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We explicitly set
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
So not a contradiction. |
||
| # TODO: This function was added due to Slurm ticket 12915. The bug is not reproducible and the ticket was then | ||
| # closed. This operation may now be useless: we need to check this. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a log line saying that there is no nodes in inactive partitions to reset?
I know it may seem a minor, but considering that we are adding this logic to fix a race condition, it may be useuful for the troubleshooting to know exactly what the daemons sees when it checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!