Skip to content

Conversation

@klueska
Copy link
Collaborator

@klueska klueska commented Jan 9, 2025

No description provided.

@klueska klueska force-pushed the add-multi-node-crd branch 10 times, most recently from 38065b4 to 3e51cd8 Compare January 14, 2025 08:15
@klueska klueska force-pushed the add-multi-node-crd branch 20 times, most recently from 4ce9bdb to 0d435d8 Compare January 22, 2025 15:56
For now just mark one well-known erro as permanent. Future commits will
abstract this better and mark more errors as permanaent.

Signed-off-by: Kevin Klues <[email protected]>
@klueska klueska force-pushed the add-multi-node-crd branch 2 times, most recently from df90001 to 764c3c7 Compare February 19, 2025 15:43
// ComputeDomainSpec provides the spec for a ComputeDomain.
type ComputeDomainSpec struct {
NumNodes int `json:"numNodes"`
Channel *ComputeDomainChannelSpec `json:"channel"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should channel be optional thinking on non imex use cases in the future, I know currently we are solely focused on imex support, but if we want to carry on the concept of computeDomain, we might face clusters without imex (channels)

@@ -0,0 +1,89 @@
/*
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe like some clothing brands do Since 1987* but I am not lawyer, maybe the number on the license header has a deeper legal meaning

@@ -0,0 +1,49 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is DRA 2022 old?

tail -f /dev/null & wait
fi
/usr/bin/nvidia-imex -c /etc/nvidia-imex/config.cfg
tail -n +1 -f /var/log/nvidia-imex.log & wait
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in a sync meeting a while ago: here we give up control of the IMEX daemon process (do we? how does errexit behave when a daemonized process exits non-zero?). In any case, for robustness and debuggability it will be good to actively monitor the health of the IMEX daemon process (polling the process, or better: getting a health signal actively and straight from the process). I'd like to look into that at some point, after merge.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a problem. If the daemon crashes we will not exit (but the liveness probe will eventually fail and the pod will be restarted). We should make it more robust as a followup (probably by not doing everything in bash but instead writing a small go utility).

@klueska klueska merged commit d1fad7e into NVIDIA:main Feb 20, 2025
4 checks passed
@klueska klueska added this to the v25.3.0 milestone Aug 13, 2025
@klueska klueska deleted the add-multi-node-crd branch August 20, 2025 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

4 participants