Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
0b1ada9
Bugfix: inconsistently setting PMIX_JOB_RECOVERABLE
Matthew-Whitlock Oct 9, 2025
f4972be
Merge pull request #85 from openpmix/master
hppritcha Oct 10, 2025
ac77387
Clarify help messages
cniethammer Oct 20, 2025
8451015
Update with new log2 and tool_connected2 upcalls
rhc54 Oct 23, 2025
191216d
Fix the definition checks for tool_connected2 and log2 integration
rhc54 Oct 23, 2025
c806b92
Merge pull request #88 from openpmix/master
hppritcha Oct 24, 2025
1afde4c
Check only for existence of PMIx capability flag
rhc54 Oct 24, 2025
c8d098a
Add some minor verbose debug output
rhc54 Oct 24, 2025
04ff9fe
Merge pull request #90 from openpmix/master
hppritcha Oct 25, 2025
16d8412
Do not assign DVM's bookmark to the application job
rhc54 Oct 26, 2025
73729ee
Merge pull request #91 from openpmix/master
hppritcha Oct 27, 2025
8994252
Fix printout of binding cpus
rhc54 Oct 27, 2025
f92d91b
Merge pull request #92 from openpmix/master
hppritcha Oct 28, 2025
665c38e
Error out when asymmetric topologies cannot support ppr requests
rhc54 Oct 30, 2025
b972f05
Merge pull request #94 from openpmix/master
hppritcha Oct 30, 2025
cb17cce
Let seq and rankfile mappers compute their own num-procs
rhc54 Oct 30, 2025
58130c6
Fix relative node processing
rhc54 Oct 30, 2025
2ff7d6b
Replace sprintf with snprintf
rhc54 Oct 31, 2025
d615ba0
Merge pull request #95 from openpmix/master
hppritcha Oct 31, 2025
d072f27
Extend timeout to child jobs
rhc54 Nov 2, 2025
424480d
Add launching-apps section to docs
rhc54 Nov 3, 2025
db94eb7
Merge pull request #96 from openpmix/master
hppritcha Nov 3, 2025
5307ee3
Fix map-by pe-list when using core CPUs
rhc54 Nov 4, 2025
761e618
Merge pull request #97 from openpmix/master
hppritcha Nov 5, 2025
e0fbe06
Allow PMIx group construct caller to specify the order of the final m…
rhc54 Nov 5, 2025
7c82568
Merge pull request #98 from openpmix/master
hppritcha Nov 6, 2025
9cc43ea
Correct show-help message
rhc54 Sep 29, 2025
1259ec2
Merge pull request #99 from openpmix/master
hppritcha Nov 7, 2025
7e5d030
Improve hetero node detection a bit
rhc54 Nov 7, 2025
e6378c6
Merge pull request #100 from openpmix/master
hppritcha Nov 8, 2025
2845dcd
Tweak the forwarding of signals
rhc54 Nov 8, 2025
4671290
Cleanup and improve autohandling of hetero nodes
rhc54 Nov 7, 2025
bff13fb
Fix prun tool
rhc54 Nov 9, 2025
807ad1a
Merge pull request #101 from openpmix/master
hppritcha Nov 10, 2025
6715d14
Merge remote-tracking branch 'origin/master' into ompi_main
hppritcha Nov 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,7 @@ test/iostress
test/spawn_multiple
test/clichk
test/chkfs
test/spawn_timeout

test/mpi/spawn_multiple
test/mpi/create_comm_from_group
Expand Down
15 changes: 11 additions & 4 deletions config/prte_setup_pmix.m4
Original file line number Diff line number Diff line change
Expand Up @@ -39,10 +39,7 @@ AC_DEFUN([PRTE_CHECK_PMIX_CAP],[
#error This PMIx does not have any capability flags
#endif
#if !defined(PMIX_CAP_$1)
#error This PMIx does not have the PMIX_CAP_$1 capability flag at all
#endif
#if (PMIX_CAPABILITIES & PMIX_CAP_$1) == 0
#error This PMIx does not have the PMIX_CAP_$1 capability flag set
#error This PMIx does not have the PMIX_CAP_$1 capability flag
#endif
]
)
Expand Down Expand Up @@ -183,6 +180,16 @@ AC_DEFUN([PRTE_CHECK_PMIX],[
PRTE_FLAGS_APPEND_UNIQ(LDFLAGS, $PRTE_FINAL_LDFLAGS)
PRTE_FLAGS_APPEND_UNIQ(LIBS, $PRTE_FINAL_LIBS)

AC_MSG_CHECKING([for support of version 2 server upcalls])
PRTE_CHECK_PMIX_CAP([UPCALLS2],
[AC_MSG_RESULT([yes])
prte_server2_upcalls=1],
[AC_MSG_RESULT([no])
prte_server2_upcalls=0])
AC_DEFINE_UNQUOTED([PRTE_PMIX_SERVER2_UPCALLS],
[$prte_server2_upcalls],
[Whether or not PMIx supports server2 upcalls])

AC_MSG_CHECKING([for in-memory show-help content compatibility])
PRTE_CHECK_PMIX_CAP([INMEMHELP],
[AC_MSG_RESULT([yes])],
Expand Down
3 changes: 2 additions & 1 deletion docs/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# Copyright (c) 2022-2023 Cisco Systems, Inc. All rights reserved.
# Copyright (c) 2023 Jeffrey M. Squyres. All rights reserved.
#
# Copyright (c) 2023-2024 Nanook Consulting All rights reserved.
# Copyright (c) 2023-2025 Nanook Consulting All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
Expand Down Expand Up @@ -39,6 +39,7 @@ RST_SOURCE_FILES = \
$(srcdir)/prrte-rst-content/*.rst \
$(srcdir)/placement/*.rst \
$(srcdir)/hosts/*.rst \
$(srcdir)/launching-apps/*.rst \
$(srcdir)/how-things-work/*.rst \
$(srcdir)/how-things-work/schedulers/*.rst \
$(srcdir)/developers/*.rst \
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ Table of contents
how-things-work/index
hosts/index
placement/index
launching-apps/index
notifications
session-directory
developers/index
Expand Down
293 changes: 293 additions & 0 deletions docs/launching-apps/gridengine.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,293 @@
Launching with Grid Engine
==========================

PRRTE supports the family of run-time schedulers including the Sun
Grid Engine (SGE), Oracle Grid Engine (OGE), Grid Engine (GE), Son of
Grid Engine, Open Cluster Scheduler (OCS), Gridware Cluster Scheduler (GCS)
and others.

This documentation will collectively refer to all of them as "Grid
Engine", unless a referring to a specific flavor of the Grid Engine
family.

Verify Grid Engine support
--------------------------

.. important:: To build Grid Engine support in PRRTE, you will need
to explicitly request the SGE support with the ``--with-sge``
command line switch to PRRTE's ``configure`` script.

To verify if support for Grid Engine is configured into your PRRTE
installation, run ``prte_info`` as shown below and look for
``gridengine``.

.. code-block::

shell$ prte_info | grep gridengine
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)


Launching
---------

When Grid Engine support is included, PRRTE will automatically
detect when it is running inside SGE and will just "do the Right
Thing."

Specifically, if you execute an ``prterun`` command in a Grid Engine
job, it will automatically use the Grid Engine mechanisms to launch
and kill processes. There is no need to specify what nodes to run on
|mdash| PRRTE will obtain this information directly from Grid
Engine and default to a number of processes equal to the slot count
specified. For example, this will run 4 application processes on the nodes
that were allocated by Grid Engine:

.. code-block:: sh

# Get the environment variables for Grid Engine

# (Assuming Grid Engine is installed at /opt/sge and $Grid
# Engine_CELL is 'default' in your environment)
shell$ . /opt/sge/default/common/settings.sh

# Allocate an Grid Engine interactive job with 4 slots from a
# parallel environment (PE) named 'foo' and run a 4-process job
shell$ qrsh -pe foo 4 -b y prterun -n 4 mpi-hello-world

There are also other ways to submit jobs under Grid Engine:

.. code-block:: sh

# Submit a batch job with the 'prterun' command embedded in a script
shell$ qsub -pe foo 4 my_prterun_job.csh

# Submit a Grid Engine and application job and prterun in one line
shell$ qrsh -V -pe foo 4 prterun hostname

# Use qstat(1) to show the status of Grid Engine jobs and queues
shell$ qstat -f

In reference to the setup, be sure you have a Parallel Environment
(PE) defined for submitting parallel jobs. You don't have to name your
PE "foo". The following example shows a PE named "foo" that would
look like:

.. code-block::

shell$ qconf -sp foo
pe_name foo
slots 99999
user_lists NONE
xuser_lists NONE
start_proc_args NONE
stop_proc_args NONE
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
qsort_args NONE

.. note:: ``qsort_args`` is necessary with the Son of Grid Engine
distribution, version 8.1.1 and later, and probably only applicable
to it.

.. note:: For very old versions of Sun Grid Engine, omit
``accounting_summary`` too.

.. note:: For Open Cluster Scheduler / Gridware Cluster Scheduler it is
necessary to set ``ign_sreq_on_mhost`` (ignoring slave resource requests
on the master node) to ``FALSE``.

You may want to alter other parameters, but the important one is
``control_slaves``, specifying that the environment has "tight
integration". Note also the lack of a start or stop procedure. The
tight integration means that mpirun automatically picks up the slot
count to use as a default in place of the ``-n`` argument, picks up a
host file, spawns remote processes via ``qrsh`` so that Grid Engine
can control and monitor them, and creates and destroys a per-job
temporary directory (``$TMPDIR``), in which PRTE's directory will
be created (by default).

Be sure the queue will make use of the PE that you specified:

.. code-block::

shell$ qconf -sq all.q
[...snipped...]
pe_list make cre foo
[...snipped...]

To determine whether the Grid Engine parallel job is successfully
launched to the remote nodes, you can pass in the MCA parameter
``--prtemca plm_base_verbose 1`` to ``prterun``.

This will add in a ``-verbose`` flag to the ``qrsh -inherit`` command
that is used to send parallel tasks to the remote Grid Engine
execution hosts. It will show whether the connections to the remote
hosts are established successfully or not.

Various Grid Engine documentation with pointers to more used to be available
at `the Son of GridEngine site <http://arc.liv.ac.uk/sge/>`_, and
configuration instructions were found at `the Son of GridEngine
configuration how-to site
<http://arc.liv.ac.uk/SGE/howto/sge-configs.html>`_. This may no longer
be true.

An actively developed (2024, 2025) open source successor of Sun Grid Engine is
`Open Cluster Scheduler <https://github.com/hpc-gridware/clusterscheduler>`_.
It maintains backward compatibility with SGE and provides many new features.
An MPI parallel environment setup for OpenMPI is available in
`the Open Cluster Scheduler GitHub repository
<https://github.com/hpc-gridware/clusterscheduler/tree/master/source/dist/mpi/openmpi>`_.

Grid Engine tight integration support of the ``qsub -notify`` flag
------------------------------------------------------------------

If you are running SGE 6.2 Update 3 or later, then the ``-notify``
flag is supported. If you are running earlier versions, then the
``-notify`` flag will not work and using it will cause the job to be
killed.

To use ``-notify``, one has to be careful. First, let us review what
``-notify`` does. Here is an excerpt from the qsub man page for the
``-notify`` flag.

The ``-notify`` flag, when set causes Sun Grid Engine to send
warning signals to a running job prior to sending the signals
themselves. If a SIGSTOP is pending, the job will receive a SIGUSR1
several seconds before the SIGSTOP. If a SIGKILL is pending, the
job will receive a SIGUSR2 several seconds before the SIGKILL. The
amount of time delay is controlled by the notify parameter in each
queue configuration.

Let us assume the reason you want to use the ``-notify`` flag is to
get the SIGUSR1 signal prior to getting the SIGTSTP signal. PRRTE forwards
some signals by default, but others need to be specifically requested.
The following MCA param controls this behavior:

.. code-block::

prte_ess_base_forward_signals: Comma-delimited list of additional signals (names or integers) to forward to
application processes [\"none\" => forward nothing]. Signals provided by
default include SIGTSTP, SIGUSR1, SIGUSR2, SIGABRT, SIGALRM, and SIGCONT

Within that constraint, something like this batch script can be used:

.. code-block:: sh

#! /bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -N Job1
#$ -pe foo 16
#$ -j y
#$ -l h_rt=00:20:00
prterun -n 16 mpi-hello-world

However, one has to make one of two changes to this script for things
to work properly. By default, a SIGUSR1 signal will kill a shell
script. So we have to make sure that does not happen. Here is one way
to handle it:

.. code-block:: sh

#! /bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -N Job1
#$ -pe ompi 16
#$ -j y
#$ -l h_rt=00:20:00
exec prterun -n 16 mpi-hello-world

Alternatively, one can catch the signals in the script instead of doing
an exec on the mpirun:

.. code-block:: sh

#! /bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -N Job1
#$ -pe ompi 16
#$ -j y
#$ -l h_rt=00:20:00

function sigusr1handler()
{
echo "SIGUSR1 caught by shell script" 1>&2
}

function sigusr2handler()
{
echo "SIGUSR2 caught by shell script" 1>&2
}

trap sigusr1handler SIGUSR1
trap sigusr2handler SIGUSR2

prterun -n 16 mpi-hello-world

Grid Engine job suspend / resume support
----------------------------------------

To suspend the job, you send a SIGTSTP (not SIGSTOP) signal to
``prterun``. ``prterun`` will catch this signal and forward it to the
``mpi-hello-world`` as a SIGSTOP signal. To resume the job, you send
a SIGCONT signal to ``prterun`` which will be caught and forwarded to
the ``mpi-hello-world``.

Here is an example on Solaris:

.. code-block:: sh

shell$ prterun -n 2 mpi-hello-world

In another window, we suspend and continue the job:

.. code-block:: sh

shell$ prstat -p 15301,15303,15305
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
15305 rolfv 158M 22M cpu1 0 0 0:00:21 5.9% mpi-hello-world/1
15303 rolfv 158M 22M cpu2 0 0 0:00:21 5.9% mpi-hello-world/1
15301 rolfv 8128K 5144K sleep 59 0 0:00:00 0.0% mpirun/1

shell$ kill -TSTP 15301
shell$ prstat -p 15301,15303,15305
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
15303 rolfv 158M 22M stop 30 0 0:01:44 21% mpi-hello-world/1
15305 rolfv 158M 22M stop 20 0 0:01:44 21% mpi-hello-world/1
15301 rolfv 8128K 5144K sleep 59 0 0:00:00 0.0% mpirun/1

shell$ kill -CONT 15301
shell$ prstat -p 15301,15303,15305
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
15305 rolfv 158M 22M cpu1 0 0 0:02:06 17% mpi-hello-world/1
15303 rolfv 158M 22M cpu3 0 0 0:02:06 17% mpi-hello-world/1
15301 rolfv 8128K 5144K sleep 59 0 0:00:00 0.0% mpirun/1

Note that all this does is stop the ``mpi-hello-world`` processes. It
does not, for example, free any pinned memory when the job is in the
suspended state.

To get this to work under the Grid Engine environment, you have to
change the ``suspend_method`` entry in the queue. It has to be set to
SIGTSTP. Here is an example of what a queue should look like.

.. code-block:: sh

shell$ qconf -sq all.q
qname all.q
[...snipped...]
starter_method NONE
suspend_method SIGTSTP
resume_method NONE

Note that if you need to suspend other types of jobs with SIGSTOP
(instead of SIGTSTP) in this queue then you need to provide a script
that can implement the correct signals for each job type.
Loading
Loading