Skip to content

Conversation

@jgehrcke
Copy link
Collaborator

Add specification for current numNodes behavior as well as guidance towards non-reliance on the CD-global Status field.

As we remove this soon, it doesn't need to be perfect. But I'd like for us to get into the habit of having complete and correct specification of all parameters in our API surface; that will then be the reference for code -- and spec patches are some of the most interesting ones.

Resolves #615.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 24, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jgehrcke jgehrcke moved this from Backlog to In Progress in Planning Board: k8s-dra-driver-gpu Sep 24, 2025
@jgehrcke jgehrcke added this to the v25.8.0 milestone Sep 24, 2025
@klueska klueska added the documentation Issue/PR focused on fixing/editing/adding documentation bits label Sep 24, 2025
@jgehrcke jgehrcke added the API issue/pr related to API (changes, specification, ...) label Sep 24, 2025
Comment on lines 64 to 65
// ComputeDomain. Must be zero or greater. Recommended to be set to zero:
// workload must implement and consult its own source of truth for the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only recommend setting numNodes to 0 when IMEXDaemonsWithDNSNames=true. Setting it to 0 when IMEXDaemonsWithDNSNames=false will result in IMEX daemons being brought online without first filling out their nodes_config.cfg files with all indended IPs of the ComputeDomain. That was the whole point of the gating in the first place.

Copy link
Collaborator Author

@jgehrcke jgehrcke Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only recommend setting numNodes to 0 when IMEXDaemonsWithDNSNames=true.

Of course. Will make an update.

This really is clear to me -- the fact that I forgot to make that case distinction explicitly is a reflection of how enthusiastic I am about leaving the "legacy world" behind us ASAP: precisely to not have to think through everything from two very different perspectives for a long period of time.

Copy link
Collaborator

@klueska klueska Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably also worth saying that it must be equal to the expected number of workers joining the ComputeDomain when IMEXDaemonsWithDNSNames=false

Copy link
Collaborator Author

@jgehrcke jgehrcke Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed another update. I like this discussion + iteration.

@jgehrcke jgehrcke force-pushed the jp/numnodes-spec branch 4 times, most recently from 2e4971c to 5a0112c Compare September 25, 2025 18:01
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
@klueska klueska merged commit 49764c3 into NVIDIA:main Oct 6, 2025
10 of 11 checks passed
@klueska klueska moved this from In Progress to Closed in Planning Board: k8s-dra-driver-gpu Oct 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API issue/pr related to API (changes, specification, ...) documentation Issue/PR focused on fixing/editing/adding documentation bits

Projects

Development

Successfully merging this pull request may close these issues.

Add specs around setting numNodes=0 when IMEXDaemonsWithDNSNames=true

2 participants