Skip to content

dashboard cluster status page has wrong GPU info #922

@prod-feng

Description

@prod-feng

Hi,

Ondemend 4.0.5, Rocky Linux 9.4.

dashboard cluster status page has wrong GPU info.

In file "/opt/ood/ondemand/root/usr/share/gems/3.3/ondemand/4.0.5-1/gems/ood_core-0.27.1/lib/ood_core/job/adapters/slurm.rb"

Line 113:

gres_length = call("sinfo", "-o %G").lines.map(&:strip).map(&:length).max + 2

which is too small for some new GPUs, like

$  sinfo -o %G
...
gpu:h200:4(S:1)

Then,

$ sinfo -ahNO,nodehost,gres:17,gresused:17|uniq
...
h200x8-02           gpu:h200:8(S:0-1)gpu:h200:8(IDX:0-

The above line for GPU info put two parts into one word "gpu:h200:8(S:0-1)gpu:h200:8(IDX:0-", which then fails.

Change it to bigger length:

gres_length = call("sinfo", "-o %G").lines.map(&:strip).map(&:length).max + 6

can fix it.

Or the easiest way is to just skip the length test, can change line 14 to

gres_lines = call("sinfo", "-ahNO ,nodehost,gres,gresused")

Best,

Feng

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Reviewed, Scheduled

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions