Skip to content

Run scontrol as a standalone subprocess and parse its output in Python#709

Merged
hanwen-cluster merged 1 commit into
aws:developfrom
hanwen-cluster:al2vsal2023
Jun 22, 2026
Merged

Run scontrol as a standalone subprocess and parse its output in Python#709
hanwen-cluster merged 1 commit into
aws:developfrom
hanwen-cluster:al2vsal2023

Conversation

@hanwen-cluster

@hanwen-cluster hanwen-cluster commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Description of changes

get_nodes_info, get_partitions_info, and get_slurm_reservations_info previously invoked scontrol through a shell pipeline (scontrol ... | awk ... | grep ...). When the pipeline failed, only the last command's exit code was reported, so a scontrol failure and a parsing failure were indistinguishable.

This change runs scontrol on its own and does the field extraction and parsing in Python:

Add _run_scontrol_command, which runs scontrol as a single subprocess (no shell), capturing stdout, stderr, and the exit code separately and logging them. This replaces the awk record splitting and grep -oP field filtering. _parse_nodes_info, _parse_reservations_info, and _parse_partitions_info now consume the list of extracted records get_partitions_info now uses scontrol show partitions (the -o oneliner is no longer needed, since records are split on blank lines like the node and reservation commands).

  • shell=True is removed from these paths, which is also a security improvement; awk and grep are no longer required at runtime.
  • Partition filtering is now an exact match on the partition name (the previous grep -e "PartitionName=" also matched names by substring).

Tests

  • Unit tests updated and added to cover scontrol exiting non-zero with and without output, timeouts, empty output, a parsing failure surfacing, and the record extraction for nodes, partitions, and reservations.
  • The following integration tests have been passed
{%- import 'common.jinja2' as common with context -%}
{{- common.OSS_COMMERCIAL_X86.append("rocky8") or "" -}}
{{- common.OSS_COMMERCIAL_X86.append("rocky9") or "" -}}
---
test-suites:
  schedulers:
    test_slurm.py::test_slurm:
      dimensions:
        - regions: ["eu-central-1"]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ OS_X86_0 }}]
          schedulers: ["slurm"]
    test_slurm.py::test_slurm_from_login_nodes_in_private_network:
      dimensions:
        - regions: ["eu-west-1"]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ OS_X86_2 }}]
          schedulers: ["slurm"]
    test_slurm.py::test_slurm_scaling:
      dimensions:
        - regions: ["us-east-2"]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ OS_X86_4 }}]
          schedulers: [ "slurm" ]
    test_slurm.py::test_error_handling:
      dimensions:
        - regions: ["eu-west-1"]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ OS_X86_6 }}]
          schedulers: ["slurm"]
    test_slurm.py::test_slurm_protected_mode:
      dimensions:
        - regions: ["ca-central-1"]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ OS_X86_1 }}]
          schedulers: ["slurm"]
    test_slurm.py::test_slurm_protected_mode_on_cluster_create:
      dimensions:
        - regions: ["ap-east-1"]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ OS_X86_3 }}]
          schedulers: ["slurm"]
    test_slurm.py::test_fast_capacity_failover:
      dimensions:
        - regions: ["ap-east-1"]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ OS_X86_5 }}]
          schedulers: ["slurm"]
    test_slurm.py::test_expedited_requeue:
      dimensions:
        - regions: ["ap-east-1"]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ OS_X86_5 }}]
          schedulers: ["slurm"]
    test_slurm.py::test_slurm_config_update:
      dimensions:
        - regions: [ "ap-east-1" ]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ OS_X86_6 }}]
          schedulers: [ "slurm" ]
    test_slurm.py::test_slurm_memory_based_scheduling:
      dimensions:
        - regions: ["ap-east-1"]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ OS_X86_0 }}]
          schedulers: ["slurm"]
    test_slurm.py::test_scontrol_reboot:
      dimensions:
        - regions: ["us-east-1"]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ OS_X86_2 }}]
          schedulers: ["slurm"]
    test_slurm.py::test_scontrol_reboot_ec2_health_checks:
      dimensions:
        - regions: ["us-east-2"]
          instances: ["t3.medium"]
          oss: [{{ OS_X86_4 }}]
          schedulers: ["slurm"]
    test_slurm.py::test_slurm_overrides:
      dimensions:
        - regions: ["us-west-1"]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ OS_X86_6 }}]
          schedulers: ["slurm"]
    test_slurm_accounting.py::test_slurm_accounting:
      dimensions:
        - regions: ["ap-south-1"]
          instances:  {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ OS_X86_1 }}]
          schedulers: ["slurm"]
    test_slurm_accounting.py::test_slurm_accounting_external_dbd:
      dimensions:
        - regions: [ "ap-south-1" ]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ OS_X86_3 }}]
          schedulers: ["slurm"]
    test_slurm.py::test_slurm_custom_config_parameters:
      dimensions:
        - regions: ["ap-southeast-3"]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ OS_X86_5 }}]
          schedulers: ["slurm"]
    test_slurm.py::test_slurm_custom_partitions:
      dimensions:
        - regions: ["ap-northeast-2"]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ OS_X86_6 }}]
          schedulers: ["slurm"]
    test_custom_munge_key.py::test_custom_munge_key:
      dimensions:
        - regions: ["eu-west-1"]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ OS_X86_0 }}]
          schedulers: ["slurm"]
    test_slurm_rest_api.py::test_slurm_rest_api:
      dimensions:
        - regions: [ "eu-north-1" ]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ OS_X86_0 }}]
          schedulers: [ "slurm" ]

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hanwen-cluster hanwen-cluster requested review from a team as code owners June 19, 2026 18:23
@hanwen-cluster hanwen-cluster added skip-changelog-update skip-security-exclusions-check Skip the checks regarding the security exclusions labels Jun 19, 2026
stderr,
)
if raise_on_error:
raise subprocess.CalledProcessError(result.returncode, command, output=stdout, stderr=result.stderr)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Block]

def _run_command(command_function, command, env=None, raise_on_error=True, execute_as_user=None, log_error=True):
try:
if env is None:
env = {}
env.update(os.environ.copy())
if execute_as_user:
log.debug("Executing command as user '%s': %s", execute_as_user, command)
pw_record = pwd.getpwnam(execute_as_user)
user_uid = pw_record.pw_uid
user_gid = pw_record.pw_gid
preexec_fn = _demote(user_uid, user_gid)
return command_function(command, env, preexec_fn)
else:
log.debug("Executing command: %s", command)
return command_function(command, env, None)
except subprocess.CalledProcessError as e:
# CalledProcessError.__str__ already produces a significant error message
if raise_on_error:
if log_error:
log.error(e)
raise
else:
if log_error:
log.warning(e)
return e
except OSError as e:
log.error("Unable to execute the command %s. Failed with exception: %s", command, e)
raise

In the old _run_command it also catches OSError, can we add the except OSError to align?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Added in line 386 - 388

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@hehe7318 hehe7318 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good PR, only 1 blocking comment, others LGTM.

`get_nodes_info`, `get_partitions_info`, and `get_slurm_reservations_info` previously invoked `scontrol` through a shell pipeline (`scontrol ... | awk ... | grep ...`). When the pipeline failed, only the last command's exit code was reported, so a `scontrol` failure and a parsing failure were indistinguishable.

This change runs scontrol on its own and does the field extraction and parsing in Python:

Add `_run_scontrol_command`, which runs scontrol as a single subprocess (no shell), capturing stdout, stderr, and the exit code separately and logging them. This replaces the awk record splitting and grep -oP field filtering.
`_parse_nodes_info`, `_parse_reservations_info`, and `_parse_partitions_info` now consume the list of extracted records
`get_partitions_info` now uses scontrol show partitions (the `-o` oneliner is no longer needed, since records are split on blank lines like the node and reservation commands).

* `shell=True` is removed from these paths, which is also a security improvement; `awk` and `grep` are no longer required at runtime.
* Partition filtering is now an exact match on the partition name (the previous grep -e "PartitionName=<name>" also matched names by substring).
@hanwen-cluster hanwen-cluster merged commit 00356a4 into aws:develop Jun 22, 2026
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog-update skip-security-exclusions-check Skip the checks regarding the security exclusions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants