Skip to content

Commit 21c8988

Browse files
Merge pull request #282 from almaslennikov/lifecycle-fixes
chore: Update the lifecycle management doc
2 parents aca2dde + 2597c1f commit 21c8988

File tree

6 files changed

+175
-128
lines changed

6 files changed

+175
-128
lines changed

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -213,7 +213,7 @@ gen-docs: build-cache
213213
.PHONY: generate-docs-versions-var
214214
generate-docs-versions-var: | $(BUILDDIR)
215215
curl -sL ${RELEASE_YAML_URL} -o $(CURDIR)/build/release.yaml
216-
cd hack/release && go run release.go --releaseDefaults $(CURDIR)/build/release.yaml --templateDir ./templates/vars --outputDir ../../docs/common/
216+
cd hack/release && go run release.go --releaseDefaults $(CURDIR)/build/release.yaml --releaseVersions $(CURDIR)/hack/release/versions.txt --templateDir ./templates/vars --outputDir ../../docs/common/
217217
cd hack/release && go run release.go --with-sha256 --releaseDefaults $(CURDIR)/build/release.yaml --templateDir ./templates/image-sha256 --outputDir ../../docs/advanced/
218218

219219
.PHONY: release-build

docs/common/vars.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,3 +49,6 @@
4949
.. |k8s-launch-kit-component-version| replace:: v25.10.0-beta.5
5050
.. |k8s-launch-kit-repository| replace:: nvcr.io/nvstaging/mellanox
5151
.. |k8s-launch-kit-network-operator-repository| replace:: nvcr.io/nvstaging/mellanox
52+
.. |current-ga-version| replace:: 25.10.x
53+
.. |current-maintenance-version| replace:: 25.7.x
54+
.. |current-eol-version| replace:: 25.4.x

docs/life-cycle-management.rst

Lines changed: 106 additions & 127 deletions
Original file line numberDiff line numberDiff line change
@@ -53,13 +53,13 @@ The product life cycle and versioning are subject to change in the future.
5353
* - Network Operator Version
5454
- Status
5555

56-
* - 25.1.x
56+
* - |current-ga-version|
5757
- Generally Available
5858

59-
* - 24.10.x
59+
* - |current-maintenance-version|
6060
- Maintenance
6161

62-
* - 24.7.x and lower
62+
* - |current-eol-version| and lower
6363
- EOL
6464

6565

@@ -115,21 +115,6 @@ Get the NicClusterPolicy status:
115115
Network Operator Upgrade
116116
========================
117117

118-
Before upgrading to Network Operator v24.10 or newer with SR-IOV Network Operator enabled, the following manual actions are required:
119-
120-
.. code-block:: bash
121-
122-
$ kubectl -n nvidia-network-operator scale deployment network-operator-sriov-network-operator --replicas 0
123-
124-
$ kubectl -n nvidia-network-operator delete sriovnetworknodepolicies.sriovnetwork.openshift.io default
125-
126-
The network operator provides limited upgrade capabilities, which require additional manual actions if a containerized DOCA-OFED Driver is used. Future releases of the network operator will provide an automatic upgrade flow for the containerized driver.
127-
128-
Since Helm does not support auto-upgrade of existing CRDs, the user must follow a two-step process to upgrade the network-operator release:
129-
130-
* Upgrade the CRD to the latest version
131-
* Apply Helm chart update
132-
133118
----------------------------
134119
Downloading a New Helm Chart
135120
----------------------------
@@ -143,129 +128,37 @@ To obtain new releases, run:
143128
$ helm fetch \https://helm.ngc.nvidia.com/nvidia/charts/network-operator-|helm-chart-version|.tgz
144129
$ ls network-operator-\*.tgz | xargs -n 1 tar xf
145130
146-
147-
-------------------------------------
148-
Upgrading CRDs for a Specific Release
149-
-------------------------------------
150-
151-
It is possible to retrieve updated CRDs from the Helm chart or from the release branch on GitHub. The example below shows how to upgrade CRDs from the downloaded chart.
152-
153-
.. code-block:: bash
154-
155-
$ kubectl apply \
156-
-f network-operator/crds \
157-
-f network-operator/charts/sriov-network-operator/crds
158-
159131
---------------------------------------------
160-
Preparing the Helm Values for the New Release
132+
Applying the Helm Chart Update
161133
---------------------------------------------
162134

163-
Edit the values-<VERSION>.yaml file as required for your cluster. The network operator has some limitations as to which updates in the NicClusterPolicy it can handle automatically. If the configuration for the new release is different from the current configuration in the deployed release, some additional manual actions may be required.
164-
165-
Known limitations:
166-
167-
* If component configuration was removed from the NicClusterPolicy, manual clean up of the component's resources (DaemonSets, ConfigMaps, etc.) may be required.
168-
* If the configuration for devicePlugin changed without image upgrade, manual restart of the devicePlugin may be required.
169-
170-
These limitations will be addressed in future releases.
171-
172-
.. warning:: Changes that were made directly in the NicClusterPolicy CR (e.g. with kubectl edit) will be overwritten by the Helm upgrade due to the `force` flag.
173-
174-
------------------------------
175-
Applying the Helm Chart Update
176-
------------------------------
135+
Edit the `values-<VERSION>.yaml` file as required for your cluster.
177136

178137
To apply the Helm chart update, run:
179138

180139
.. code-block:: bash
181140
182141
$ helm upgrade -n nvidia-network-operator network-operator nvidia/network-operator --version=<VERSION> -f values-<VERSION>.yaml --force
183142
184-
.. warning:: The --devel option is required if you wish to use the Beta release.
143+
-----------------------------
144+
Updating the NicClusterPolicy
145+
-----------------------------
185146

186-
-------------------------------
187-
DOCA-OFED Driver Manual Upgrade
188-
-------------------------------
147+
.. note::
148+
149+
Helm upgrade does not update components version in the NicClusterPolicy. It should be done manually after the upgrade is done.
189150

190-
#####################################################
191-
Restarting Pods with a Containerized DOCA-OFED Driver
192-
#####################################################
193-
194-
.. warning:: This operation is required only if containerized DOCA-OFED Driver is in use.
195-
196-
When a containerized DOCA-OFED Driver is reloaded on the node, all pods that use a secondary network based on NVIDIA NICs will lose network interface in their containers. To prevent outage, remove all pods that use a secondary network from the node before you reload the driver pod on it.
197-
198-
The Helm upgrade command will only upgrade the DaemonSet spec of the DOCA-OFED Driver to point to the new driver version. The DOCA-OFED Driver's DaemonSet will not automatically restart pods with the driver on the nodes, as it uses "OnDelete" updateStrategy. The old DOCA-OFED Driver version will still run on the node until you explicitly remove the driver pod or reboot the node:
151+
.. note::
152+
153+
The network operator has some limitations as to which updates in the NicClusterPolicy it can handle automatically. If the configuration for the new release is different from the current configuration in the deployed release, some additional manual actions may be required.
199154

200-
.. code-block:: bash
155+
Known limitations:
156+
157+
* If the configuration for devicePlugin changed without image upgrade, manual restart of the devicePlugin may be required.
201158

202-
$ kubectl delete pod -l app=mofed-<OS_NAME> -n nvidia-network-operator
159+
These limitations will be addressed in future releases.
203160

204-
It is possible to remove all pods with secondary networks from all cluster nodes, and then restart the DOCA-OFED Driver pods on all nodes at once.
205-
206-
The alternative option is to perform an upgrade in a rolling manner to reduce the impact of the driver upgrade on the cluster. The driver pod restart can be done on each node individually. In this case, pods with secondary networks should be removed from the single node only. There is no need to stop pods on all nodes.
207-
208-
For each node, follow these steps to reload the driver on the node:
209-
210-
1. Remove pods with a secondary network from the node.
211-
2. Restart the DOCA-OFED Driver pod.
212-
3. Return the pods with a secondary network to the node.
213-
214-
When the DOCA-OFED Driver is ready, proceed with the same steps for other nodes.
215-
216-
####################################################
217-
Removing Pods with a Secondary Network from the Node
218-
####################################################
219-
220-
To remove pods with a secondary network from the node with node drain, run the following command:
221-
222-
.. code-block:: bash
223-
224-
$ kubectl drain <NODE_NAME> --pod-selector=<SELECTOR_FOR_PODS>
225-
226-
.. warning:: Replace <NODE_NAME> with -l "network.nvidia.com/operator.mofed.wait=false" if you wish to drain all nodes at once.
227-
228-
###################################
229-
Restarting the DOCA-OFED Driver Pod
230-
###################################
231-
232-
Find the DOCA-OFED Driver pod name for the node:
233-
234-
.. code-block:: bash
235-
236-
$ kubectl get pod -l app=mofed-<OS_NAME> -o wide -A
237-
238-
Example for Ubuntu 20.04:
239-
240-
.. code-block:: bash
241-
242-
kubectl get pod -l app=mofed-ubuntu20.04 -o wide -A
243-
244-
###############################################
245-
Deleting the DOCA-OFED Driver Pod from the Node
246-
###############################################
247-
248-
To delete the DOCA-OFED Driver pod from the node, run:
249-
250-
.. code-block:: bash
251-
252-
$ kubectl delete pod -n <DRIVER_NAMESPACE> <DOCA_DRIVER_POD_NAME>
253-
254-
.. warning:: Replace <DOCA_DRIVER_POD_NAME> with -l app=mofed-ubuntu20.04 if you wish to remove DOCA-OFED Driver pods on all nodes at once.
255-
256-
A new version of the DOCA-OFED Driver pod will automatically start.
257-
258-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
259-
Returning Pods with a Secondary Network to the Node
260-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
261-
262-
After the DOCA-OFED Driver pod is ready on the node, you can make the node schedulable again.
263-
264-
The command below will uncordon (remove node.kubernetes.io/unschedulable:NoSchedule taint) the node, and return the pods to it:
265-
266-
.. code-block:: bash
267-
268-
$ kubectl uncordon -l "network.nvidia.com/operator.mofed.wait=false"
161+
Update the components version in the NicClusterPolicy. Refer to the :ref:`NicClusterPolicy CRD Full Example <ncp-cr-example>` for more details and latest version of the components.
269162

270163
----------------------------------
271164
Automatic DOCA-OFED Driver Upgrade
@@ -318,7 +211,7 @@ To enable automatic DOCA-OFED Driver upgrade, define the UpgradePolicy section f
318211
# specify if should continue even if there are pods using emptyDir
319212
deleteEmptyDir: false
320213
321-
Apply NicClusterPolicy CRD:
214+
Apply NicClusterPolicy CR:
322215

323216
.. code-block:: bash
324217
@@ -457,6 +350,92 @@ Troubleshooting
457350
- Manually delete the pod by using ``kubectl delete -n <Network Operator Namespace> <pod name>``.
458351
If following the restart the pod still fails, change the NVIDIA DOCA-OFED Driver version in the NicClusterPolicy to the previous version or to another working version.
459352

353+
-------------------------------
354+
DOCA-OFED Driver Manual Upgrade
355+
-------------------------------
356+
357+
Automatic DOCA-OFED Driver upgrade is the preferred method for upgrading the DOCA-OFED Driver. However, if you need to manually upgrade the DOCA-OFED Driver, you can follow the steps below.
358+
359+
#####################################################
360+
Restarting Pods with a Containerized DOCA-OFED Driver
361+
#####################################################
362+
363+
.. warning:: This operation is required only if containerized DOCA-OFED Driver is in use.
364+
365+
When a containerized DOCA-OFED Driver is reloaded on the node, all pods that use a secondary network based on NVIDIA NICs will lose network interface in their containers. To prevent outage, remove all pods that use a secondary network from the node before you reload the driver pod on it.
366+
367+
The Helm upgrade command will only upgrade the DaemonSet spec of the DOCA-OFED Driver to point to the new driver version. The DOCA-OFED Driver's DaemonSet will not automatically restart pods with the driver on the nodes, as it uses "OnDelete" updateStrategy. The old DOCA-OFED Driver version will still run on the node until you explicitly remove the driver pod or reboot the node:
368+
369+
.. code-block:: bash
370+
371+
$ kubectl delete pod -l app=mofed-<OS_NAME> -n nvidia-network-operator
372+
373+
It is possible to remove all pods with secondary networks from all cluster nodes, and then restart the DOCA-OFED Driver pods on all nodes at once.
374+
375+
The alternative option is to perform an upgrade in a rolling manner to reduce the impact of the driver upgrade on the cluster. The driver pod restart can be done on each node individually. In this case, pods with secondary networks should be removed from the single node only. There is no need to stop pods on all nodes.
376+
377+
For each node, follow these steps to reload the driver on the node:
378+
379+
1. Remove pods with a secondary network from the node.
380+
2. Restart the DOCA-OFED Driver pod.
381+
3. Return the pods with a secondary network to the node.
382+
383+
When the DOCA-OFED Driver is ready, proceed with the same steps for other nodes.
384+
385+
####################################################
386+
Removing Pods with a Secondary Network from the Node
387+
####################################################
388+
389+
To remove pods with a secondary network from the node with node drain, run the following command:
390+
391+
.. code-block:: bash
392+
393+
$ kubectl drain <NODE_NAME> --pod-selector=<SELECTOR_FOR_PODS>
394+
395+
.. warning:: Replace <NODE_NAME> with -l "network.nvidia.com/operator.mofed.wait=false" if you wish to drain all nodes at once.
396+
397+
###################################
398+
Restarting the DOCA-OFED Driver Pod
399+
###################################
400+
401+
Find the DOCA-OFED Driver pod name for the node:
402+
403+
.. code-block:: bash
404+
405+
$ kubectl get pod -l app=mofed-<OS_NAME> -o wide -A
406+
407+
Example for Ubuntu 20.04:
408+
409+
.. code-block:: bash
410+
411+
kubectl get pod -l app=mofed-ubuntu20.04 -o wide -A
412+
413+
###############################################
414+
Deleting the DOCA-OFED Driver Pod from the Node
415+
###############################################
416+
417+
To delete the DOCA-OFED Driver pod from the node, run:
418+
419+
.. code-block:: bash
420+
421+
$ kubectl delete pod -n <DRIVER_NAMESPACE> <DOCA_DRIVER_POD_NAME>
422+
423+
.. warning:: Replace <DOCA_DRIVER_POD_NAME> with -l app=mofed-ubuntu20.04 if you wish to remove DOCA-OFED Driver pods on all nodes at once.
424+
425+
A new version of the DOCA-OFED Driver pod will automatically start.
426+
427+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
428+
Returning Pods with a Secondary Network to the Node
429+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
430+
431+
After the DOCA-OFED Driver pod is ready on the node, you can make the node schedulable again.
432+
433+
The command below will uncordon (remove node.kubernetes.io/unschedulable:NoSchedule taint) the node, and return the pods to it:
434+
435+
.. code-block:: bash
436+
437+
$ kubectl uncordon -l "network.nvidia.com/operator.mofed.wait=false"
438+
460439
--------------------------------------------------------
461440
Network Operator Upgrade on OpenShift Container Platform
462441
--------------------------------------------------------

hack/release/release.go

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ import (
2424
"os"
2525
"path/filepath"
2626
"reflect"
27+
"regexp"
2728
"sort"
2829
"strings"
2930
"text/template"
@@ -82,6 +83,9 @@ type Release struct {
8283
NicConfigurationConfigDaemon *ReleaseImageSpec
8384
MaintenanceOperator *ReleaseImageSpec
8485
SpectrumXOperator *ReleaseImageSpec
86+
CurrentGAVersionMajorMinor string
87+
CurrentMaintenanceMajorMinor string
88+
CurrentEOLMajorMinor string
8589
}
8690

8791
func readDefaults(releaseDefaults string) Release {
@@ -136,6 +140,7 @@ func readEnvironmentVariables(release *Release) {
136140
func main() {
137141
templateDir := flag.String("templateDir", ".", "Directory with templates to render")
138142
outputDir := flag.String("outputDir", ".", "Destination directory to render templates to")
143+
releaseVersions := flag.String("releaseVersions", "", "File with release versions to use for the templates. Won't be used if not provided")
139144
releaseDefaults := flag.String("releaseDefaults", "release.yaml", "Destination of the release defaults definition")
140145
retrieveSha := flag.Bool("with-sha256", false, "retrieve SHA256 for container images references")
141146
flag.Parse()
@@ -150,6 +155,13 @@ func main() {
150155
os.Exit(1)
151156
}
152157
}
158+
if *releaseVersions != "" {
159+
err := handleReleaseVersions(&release, *releaseVersions)
160+
if err != nil {
161+
fmt.Printf("Error: %v\n", err)
162+
os.Exit(1)
163+
}
164+
}
153165
var files []string
154166
err := filepath.Walk(*templateDir, func(path string, info os.FileInfo, err error) error {
155167
// Error during traversal
@@ -376,3 +388,48 @@ func getAuth(repo string) remote.Option {
376388
return remote.WithAuthFromKeychain(authn.DefaultKeychain)
377389
}
378390
}
391+
392+
// handleReleaseVersions updates the release versions file with the current major.minor version
393+
// and fills the CurrentGAVersionMajorMinor, CurrentMaintenanceMajorMinor, and CurrentEOLMajorMinor fields.
394+
func handleReleaseVersions(release *Release, releaseVersionsPath string) error {
395+
content, err := os.ReadFile(filepath.Clean(releaseVersionsPath))
396+
if err != nil {
397+
return err
398+
}
399+
400+
rawLines := strings.Split(string(content), "\n")
401+
lines := make([]string, 0, len(rawLines))
402+
for _, l := range rawLines {
403+
trimmed := strings.TrimSpace(l)
404+
if trimmed != "" {
405+
lines = append(lines, trimmed)
406+
}
407+
}
408+
409+
// Parse version using regex: supports values like "v25.10.0-beta.5" or "25.10.0"
410+
re := regexp.MustCompile(`^v?(\d+)\.(\d+)\.\d+(?:[-+].*)?$`)
411+
matches := re.FindStringSubmatch(release.NetworkOperator.Version)
412+
if matches == nil {
413+
return fmt.Errorf("failed to parse version %q", release.NetworkOperator.Version)
414+
}
415+
currentMajorMinor := fmt.Sprintf("%s.%s.x", matches[1], matches[2])
416+
417+
// If top line doesn't match, prepend the extracted major.minor
418+
if len(lines) == 0 || lines[0] != currentMajorMinor {
419+
lines = append([]string{currentMajorMinor}, lines...)
420+
// Write back to the file; ensure trailing newline
421+
newContent := strings.Join(lines, "\n") + "\n"
422+
if err := os.WriteFile(releaseVersionsPath, []byte(newContent), 0o644); err != nil {
423+
return err
424+
}
425+
}
426+
427+
// After update, take three top lines
428+
if len(lines) < 3 {
429+
return fmt.Errorf("versions file %q has fewer than 3 lines after update", releaseVersionsPath)
430+
}
431+
release.CurrentGAVersionMajorMinor = lines[0]
432+
release.CurrentMaintenanceMajorMinor = lines[1]
433+
release.CurrentEOLMajorMinor = lines[2]
434+
return nil
435+
}

0 commit comments

Comments
 (0)