Skip to content

Conversation

@guptaNswati
Copy link
Contributor

Need openssh-server to fix this error

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "/usr/sbin/sshd": stat /usr/sbin/sshd: no such file or directory: unknown

Signed-off-by: Swati Gupta <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 18, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@guptaNswati
Copy link
Contributor Author

/ok to test

@elezar
Copy link
Member

elezar commented Mar 19, 2025

@guptaNswati why does this sample need the to be connected to by SSH? When does the error that you mention above present itself?

@elezar
Copy link
Member

elezar commented Mar 19, 2025

OK. This is required because we switched to the CUDA base image, correct?

@guptaNswati
Copy link
Contributor Author

OK. This is required because we switched to the CUDA base image, correct?

Yes

@guptaNswati
Copy link
Contributor Author

/ok to test

1 similar comment
@guptaNswati
Copy link
Contributor Author

/ok to test

@guptaNswati guptaNswati force-pushed the fix-nvbandwidth branch 2 times, most recently from d5148ec to b4e8b8b Compare March 21, 2025 00:26
@guptaNswati
Copy link
Contributor Author

/ok to test

@guptaNswati
Copy link
Contributor Author

/ok to test

Signed-off-by: Swati Gupta <[email protected]>
@guptaNswati
Copy link
Contributor Author

@guptaNswati
Copy link
Contributor Author

/ok to test

@guptaNswati
Copy link
Contributor Author

Test passed with new chnages:

$ kubectl get pod
NAME                        READY   STATUS         RESTARTS   AGE
nvbandwidth-test-worker-0   1/1     Running        0          27m
nvbandwidth-test-worker-1   1/1     Running        0          8s
nvbandwidth-test-worker-2   1/1     Running        0          27m
nvbandwidth-test-worker-3   1/1     Running        0          27m

$ kubectl logs nvbandwidth-test-worker-2 
Server listening on 0.0.0.0 port 2222.
Server listening on :: port 2222.

$ kubectl get pod
NAME                              READY   STATUS      RESTARTS   AGE
nvbandwidth-test-launcher-mjpkn   0/1     Completed   0          68s

nvbandwidth Version: v0.7
Built from Git version: v0.7

MPI version: Open MPI v4.1.2, package: Debian OpenMPI, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021
CUDA Runtime Version: 12080
CUDA Driver Version: 12080
Driver Version: 570.86.15

Process 0 (nvbandwidth-test-worker-0): device 0: NVIDIA GH200 96GB HBM3 (00000009:01:00)
Process 1 (nvbandwidth-test-worker-0): device 1: NVIDIA GH200 96GB HBM3 (00000019:01:00)
Process 2 (nvbandwidth-test-worker-1): device 0: NVIDIA GH200 96GB HBM3 (00000009:01:00)
Process 3 (nvbandwidth-test-worker-1): device 1: NVIDIA GH200 96GB HBM3 (00000019:01:00)
Process 4 (nvbandwidth-test-worker-2): device 0: NVIDIA GH200 96GB HBM3 (00000009:01:00)
Process 5 (nvbandwidth-test-worker-2): device 1: NVIDIA GH200 96GB HBM3 (00000019:01:00)
Process 6 (nvbandwidth-test-worker-3): device 0: NVIDIA GH200 96GB HBM3 (00000009:01:00)
Process 7 (nvbandwidth-test-worker-3): device 1: NVIDIA GH200 96GB HBM3 (00000019:01:00)

Running multinode_device_to_device_memcpy_read_ce.
memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
           0         1         2         3         4         5         6         7
 0       N/A    392.04    391.66    391.91    391.82    391.95    391.95    391.93
 1    391.97       N/A    391.72    391.93    391.98    391.88    391.91    391.88
 2    391.88    391.79       N/A    391.89    391.97    391.91    391.93    391.93
 3    391.97    391.81    392.02       N/A    391.86    391.98    391.97    392.00
 4    391.91    391.89    391.82    391.77       N/A    391.91    392.00    391.91
 5    391.91    391.81    391.93    391.81    391.79       N/A    391.97    392.00
 6    391.97    391.91    392.02    391.84    391.89    391.77       N/A    391.98
 7    391.93    391.91    392.02    391.79    391.82    391.73    391.91       N/A

SUM multinode_device_to_device_memcpy_read_ce 21946.36

NOTE: The reported results may not reflect the full capabilities of the platform.
Performance can vary with software drivers, hardware clocks, and system topology.

@guptaNswati guptaNswati merged commit 6b2cf85 into NVIDIA:main Mar 21, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants