Skip to content
This repository was archived by the owner on May 14, 2025. It is now read-only.
This repository was archived by the owner on May 14, 2025. It is now read-only.

Troubleshooting SSH Connection Failures in JARVICE Jobs on Kubernetes #17

@CallisteH

Description

@CallisteH

When I try to start an Ubuntu-desktop (or any job, like Rstudio or python3) job on JARVICE, I encounter the following errors from the jarvice-mc interface job output :

INIT[1]: Configuring user: nimbix nimbix 505...
INIT[1]: Initializing networking...
INIT[1]: WARNING: Cross Memory Attach not available for MPI applications
INIT[1]: Platform fabric and MPI libraries successfully deployed
INIT[1]: Detected preferred MPI fabric provider: tcp
INIT[1]: Reading keys...
INIT[1]: Finalizing setup in application environment...
INIT[1]: Waiting for job configuration before executing application...
INIT[1]: hostname: jarvice-job-2-67tfx
INIT[1]: Injecting static ssh client.
INIT[1]: Starting SSHD server...
INIT[1]: Checking all nodes can be reached through ssh...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...ssh: connect to host jarvice-job-2-67tfx port 22: Connection refused


INIT[1]: Failed to connect to jarvice-job-2-67tfx.
Sleeping 1s and retry...
INIT[1]: Timeout (60s).
Failed to establish ssh connectivity to all nodes.
Canceling job.

The pod stays in a running state for a few minutes before collapsing due to the SSH connection failure. I can go inside with the interactive mode, until I'm getting kicked out with exit code 137. This issue occurs with every image I use (Rstudio, Python, and my own Alpine image from my local registry via "Push to Compute"). I am running a JARVICE demo installation on a bare-metal Kubernetes cluster.

I understand that the problem originates from my job pod, as the nodes cannot connect to the pod via SSH. Is it possible to skip the "start sshd" part for a job? I simply want to run an Alpine container in the background without getting output in the browser.

If skipping SSH is not possible, does anyone have an idea of what might be causing this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions