-
Notifications
You must be signed in to change notification settings - Fork 4
Troubleshooting SSH Connection Failures in JARVICE Jobs on Kubernetes #17
Description
When I try to start an Ubuntu-desktop (or any job, like Rstudio or python3) job on JARVICE, I encounter the following errors from the jarvice-mc interface job output :
INIT[1]: Configuring user: nimbix nimbix 505...
INIT[1]: Initializing networking...
INIT[1]: WARNING: Cross Memory Attach not available for MPI applications
INIT[1]: Platform fabric and MPI libraries successfully deployed
INIT[1]: Detected preferred MPI fabric provider: tcp
INIT[1]: Reading keys...
INIT[1]: Finalizing setup in application environment...
INIT[1]: Waiting for job configuration before executing application...
INIT[1]: hostname: jarvice-job-2-67tfx
INIT[1]: Injecting static ssh client.
INIT[1]: Starting SSHD server...
INIT[1]: Checking all nodes can be reached through ssh...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused
INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...
INIT[1]: Timeout (60s).
Failed to establish ssh connectivity to all nodes.
Canceling job.The pod stays in a running state for a few minutes before collapsing due to the SSH connection failure. I can go inside with the interactive mode, until I'm getting kicked out with exit code 137. This issue occurs with every image I use (Rstudio, Python, and my own Alpine image from my local registry via "Push to Compute"). I am running a JARVICE demo installation on a bare-metal Kubernetes cluster.
I understand that the problem originates from my job pod, as the nodes cannot connect to the pod via SSH. Is it possible to skip the "start sshd" part for a job? I simply want to run an Alpine container in the background without getting output in the browser.
If skipping SSH is not possible, does anyone have an idea of what might be causing this issue?