Skip to content

Conversation

@dpsarmie
Copy link
Collaborator

@dpsarmie dpsarmie commented Nov 12, 2025

Commit Queue Requirements:

  • This PR addresses a relevant WM issue (if not, create an issue).
  • All subcomponent pull requests (if any) have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines), preferably on Ursa (Derecho or Hercules are acceptable alternatives). Exceptions: documentation-only PRs, CI-only PRs, etc.
    • Commit log file w/full results from RT suite run (if applicable).
    • Verify that test_changes.list indicates which tests, if any, are changed by this PR. Commit test_changes.list, even if it is empty.
  • Fill out all sections of this template.

Description:

#2979

This PR adds an option that can be configured to add a node exclusive option to PBS or SLURM when running regression tests.
The option EXCLUSIVE_NODES can be set to true or false to add the appropriate option to the PBS or SLURM job card.

For now, Ursa specific checks will be added to certain tests. Some machines that will always have the exclusive option turned on will continue to do so.

This was based off of work done by Dusan that found that the conus13km and regional regression tests ran to completion more often if the node running the test was not running other jobs on unused cores (i.e., you should use the exclusive option). This did not 100% resolve the issue but it's a step in the right direction.

An option in SLURM for consecutive nodes was also added. This was found to speed up some runs in GW (https://github.com/NOAA-EMC/global-workflow/pull/4123/files#diff-f0b943b553ef1734f72f6660f02fcf8d692d324e6cf64ad490f002aa6b9bcd12L12). This won't affect most tests since they run on a single node and are not resource intensive, but it should have an effect on RTs or machines that checkout multiple nodes.

#2992

Re-enables broken per-timestep restarts for ATM

#3000

This PR addresses multiple compiler warnings and update WW3 to the current develop branch.

A full set of Intel and GNU UFSWM RT tests was performed. Three changes were observed due to run-time timeouts.
NOTE: These changes do not involve the wave model.

regional_control intel
regional_restart intel
regional_2dwrtdecomp intel

Commit Message:

* UFSWM - Add exclusive node option
* UFSWM - updates WW3 and resolves compiler warnings
  * UFSATM - fix per-timestep restarts
  * WW3 - Addresses compiler warnings and updates WW3 to current develop branch

Priority:

  • Normal

Git Tracking

UFSWM:

Sub component Pull Requests:

UFSWM Blocking Dependencies:

  • None

Documentation:

  • Documentation update NOT required.

Changes

Regression Test Changes (Please commit test_changes.list):

  • No Baseline Changes.

Input data Changes:

  • None.

Library Changes/Upgrades:

  • No Updates

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • GaeaC6
    • Derecho
    • Ursa
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

@dpsarmie dpsarmie marked this pull request as ready for review November 13, 2025 20:40
@dpsarmie dpsarmie added the No Baseline Change No Baseline Change label Nov 13, 2025
@dpsarmie dpsarmie marked this pull request as draft November 13, 2025 20:54
@dpsarmie
Copy link
Collaborator Author

I still want to do more testing on the MSU machines before this is marked as ready for the queue.

@gspetro-NOAA gspetro-NOAA moved this from Evaluating to Draft in PRs to Process Nov 14, 2025
@dpsarmie
Copy link
Collaborator Author

dpsarmie commented Nov 18, 2025

We've had occasional issues on Hercules with long (4 hour+) runtimes for the full regression test suite, which is why I wanted to do more testing.
It looks like this could be alleviated by turning off the exclusive node option on Hercules. However, this did cause the hrrr_control_2threads_dyn32_phy32 test to fail due to time outs on rare occasions.

I'm going to leave the settings as is for Hercules. If the long runtimes on Hercules continue, we can turn that option off and turn it on for the specific tests when running on Hercules (like this PR is doing for Ursa).

Feel free to comment or make other suggestions.

@dpsarmie dpsarmie marked this pull request as ready for review November 18, 2025 16:30
@gspetro-NOAA gspetro-NOAA moved this from Draft to Evaluating in PRs to Process Nov 18, 2025
@gspetro-NOAA gspetro-NOAA moved this from Evaluating to Review in PRs to Process Nov 18, 2025
@gspetro-NOAA gspetro-NOAA moved this from Review to Schedule in PRs to Process Nov 19, 2025
@gspetro-NOAA
Copy link
Collaborator

@dpsarmie What do you think about combining #3000 and/or #2992 into this PR? We were thinking that if we can get Dusan's PR processed today and start one more PR before the holiday, yours is a fairly straightforward one. We could do it alone or combine with one of these two.

@dpsarmie
Copy link
Collaborator Author

@dpsarmie What do you think about combining #3000 and/or #2992 into this PR? We were thinking that if we can get Dusan's PR processed today and start one more PR before the holiday, yours is a fairly straightforward one. We could do it alone or combine with one of these two.

Sure, I can get both of those into this PR and have it ready by the time Dusan's PR is processed.

@gspetro-NOAA gspetro-NOAA changed the title Add exclusive SLURM option to select RTs on Ursa Add exclusive SLURM option to select RTs on Ursa // fix per-timestep restarts for ATM #2992 // Fix compilation warnings and update WW3 #3000 Dec 2, 2025
@gspetro-NOAA gspetro-NOAA added UFSATM There are changes to the UFSATM repository. WW3 There are changes to the WW3 component repository. labels Dec 2, 2025
@gspetro-NOAA gspetro-NOAA added the On Deck This is the next PR in line for testing/merge. label Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

No Baseline Change No Baseline Change On Deck This is the next PR in line for testing/merge. UFSATM There are changes to the UFSATM repository. WW3 There are changes to the WW3 component repository.

Projects

Status: Schedule

Development

Successfully merging this pull request may close these issues.

ATM restarts every timestep not working? Add an exclusive node option when submitting RTs

3 participants