Skip to content

Set MPI-executable for ParFlow to $SLURM_SRUN when Slurm#90

Merged
kvrigor merged 3 commits into
masterfrom
srun-better-fix-mvh
Jun 4, 2026
Merged

Set MPI-executable for ParFlow to $SLURM_SRUN when Slurm#90
kvrigor merged 3 commits into
masterfrom
srun-better-fix-mvh

Conversation

@mvhulten
Copy link
Copy Markdown
Contributor

This is needed when GCC+OpenMPI is used. This is tested on JURECA.

With Intel+ParaStationMPI this already went right; there is no effect when compiling with Intel LLVM. It only has effect if ${SLURM_SRUN} is set, which means it is a Slurm system (as is already assumed).

Resolves: #87
Replaces: #88

This is needed when GCC+OpenMPI is used.  This is tested on JURECA.

With Intel+ParaStationMPI this already went right; there is no effect
when compiling with Intel LLVM.  It only has effect if ${SLURM_SRUN} is
set, which means it is a Slurm system (as is already assumed).

Resolves: #87
@mvhulten
Copy link
Copy Markdown
Contributor Author

The only additional assumption, I can think of, is that some system where srun is found, requires the usage of something different from srun. I find this highly unlikely, but in case certainty is needed, someone should dig into where MPIEXEC_EXECUTABLE is originally set (I couldn't find it).

@mvhulten mvhulten requested a review from kvrigor August 18, 2025 13:45
@mvhulten
Copy link
Copy Markdown
Contributor Author

Only here I found a case where it was set. This is for a specific machine (same for two other NERSC machines). Instead of if(SLURM_SRUN) we could check if we are on a JSC machine, but I was hessitant to introduce a machine check to BuildParFlow.cmake, especially if my solution here may be OK.

@mvhulten mvhulten marked this pull request as draft August 22, 2025 14:23
@mvhulten
Copy link
Copy Markdown
Contributor Author

mvhulten commented Aug 22, 2025

It's not okay. On Marvin it fails with this message:

--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------

whereas master gives no problems on Marvin. Options

  1. Do a machine ("if on JSC") check (add to build code)
  2. Ask Marvin admins to provide OMPI with Slurm's PMI support (waiting for response)
  3. ...

WDYT?

@kvrigor
Copy link
Copy Markdown
Member

kvrigor commented Aug 25, 2025

How about this solution:

  1. Add export MPIEXEC_EXECUTABLE=/path/to/srun in environment file
  2. Modify cmake/FindMPIFortran.cmake:
    • Use the MPIEXEC_EXECUTABLE environment setting
    • If no MPIEXEC_EXECUTABLE has been set, use the autodetected MPIEXEC_EXECUTABLE

@mvhulten
Copy link
Copy Markdown
Contributor Author

I think I need to

if(JSC)
  set(MPIEXEC_EXECUTABLE "${SLURM_SRUN}")
endif()

in cmake/BuildParFlow.cmake, because Marvin does not appear to be able to use srun (with some PMI API, even though srun --mpi=list lists pmi2). I'm getting verification from my Marvin admin colleagues about this…

How do I properly check if I'm on JSC hardware?

@kvrigor
Copy link
Copy Markdown
Member

kvrigor commented Aug 26, 2025

I think I need to

if(JSC)
  set(MPIEXEC_EXECUTABLE "${SLURM_SRUN}")
endif()

in cmake/BuildParFlow.cmake, because Marvin does not appear to be able to use srun (with some PMI API, even though srun --mpi=list lists pmi2). I'm getting verification from my Marvin admin colleagues about this…

How do I properly check if I'm on JSC hardware?

It's best to keep the CMake scripts as system-agnostic as possible. All machine-specific settings should be configured thru the environment files (that's why they exist). In this case you can set MPIEXEC_EXECUTABLE in default.2025.env or on the env files themselves, and then let cmake/FindMPIFortran.cmake detect the user-defined MPIEXEC_EXECUTABLE. This way we nicely limit the scope of TSMP2 CMake scripts to compile-related options.

@mvhulten
Copy link
Copy Markdown
Contributor Author

I've done this in 30bfbcb.

But I cannot reproduce the original problem anymore. I went back to master and did a clean compile and several of the parallel Tcl scripts from the test directory all pass. I'm on the same old commits for TSMP2 and parflow. I'll look at it later again.

@mvhulten
Copy link
Copy Markdown
Contributor Author

mvhulten commented Jun 3, 2026

This or some variant (based on above comments) should be merged if it helps on Marvin and does not break running elsewhere especially JSC. If this is not the case or difficult, please close this PR.

Jan Martin and Gayatri could be involved.

@kvrigor
Copy link
Copy Markdown
Member

kvrigor commented Jun 3, 2026

@mvhulten the change looks harmless to me; moreover having the MPIEXEC_EXECUTABLE as an env variable could be potentially helpful to our CMake scripts. Do you mind replicating this change to all other env files for consistency?

EDIT: Also @mvhulten you can remove the "draft" status of this PR if you wish this to be reviewed and merged soon.. otherwise I would just assume that this PR is not ready for review yet

@mvhulten
Copy link
Copy Markdown
Contributor Author

mvhulten commented Jun 3, 2026

Will do, I'm testing changes.

Set this in machine environment file and do not set everywhere to $SLURM_SRUN.
On JSC machines srun is used, but on Marvin mpirun must be used.  The latter
requirement may relate to the missing or incomplete support of PMIx on Marvin.
@mvhulten mvhulten marked this pull request as ready for review June 3, 2026 13:31
@mvhulten
Copy link
Copy Markdown
Contributor Author

mvhulten commented Jun 3, 2026

Runtime is not tested on Marvin, but we know that running with srun on Marvin does not work, so this will not make it worse.

Please squash and merge with the commit message of the latest commit.

@kvrigor kvrigor merged commit 2d5bfeb into master Jun 4, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-process ParFlow stand-alone fails on GNU stack

2 participants