Skip to content

Scheduling multiple tasks that consume multiple GPUs in a single job fails even if GPUs are available #27102

@henrikjohansen

Description

@henrikjohansen

Nomad version

Nomad v1.9.13+ent

Issue

I have a node with 4 GPUs which are correctly fingerprinted by nomad but certain combinations of tasks / device counts are rejected with "missing devices" even though the total number of requested GPU instances never exceeds the number of fingerprinted GPUS (4).

I can run a job with a single task group and a device nvidia/gpu count of 4.

I cannot run a job with two tasks where the device nvidia/gpu count per task is 2.
The plan is rejected with "missing devices".

I can run a job with two tasks where device nvidia/gpu count is 2 in the first task and device nvidia/gpu/NVIDIA H100 NVL MIG 3g.47gb count of 2 in the second task.

Node info :

$ nomad node status -verbose <node_id>

[ ... truncated ...]
Device Resource Utilization
nvidia/gpu/NVIDIA H100 NVL MIG 3g.47gb[MIG-076513e0-6b48-5e13-ab80-b8be7c9259a6]  <none>
nvidia/gpu/NVIDIA H100 NVL MIG 3g.47gb[MIG-3e8b3066-e6de-5b42-b70a-c349e5eed28f]  <none>
nvidia/gpu/NVIDIA H100 NVL MIG 3g.47gb[MIG-6715152b-4058-5804-bf60-7652876231e2]  <none>
nvidia/gpu/NVIDIA H100 NVL MIG 3g.47gb[MIG-f2187ec1-76f0-5f4d-8caf-e3bc17901c1a]  <none>
[ ... truncated ...]

Reproduction steps

This jobs fails with "missing devices" :

job "gpu-test-1" {
  namespace = "foo"
  node_pool = "foo"
  type      = "service"

  constraint {
    attribute = "${attr.unique.hostname}"
    value     = "bar"
  }

  group "test" {
    count = 1

    task "task-one" {
      driver = "docker"

      config {
        image = "ubuntu:latest"
        args  = ["sleep", "infinity"]
      }

      resources {
        cpu    = 500
        memory = 1024

        device "nvidia/gpu" {
          count = 2
        }
      }
    }

    task "task-two" {
      driver = "docker"

      config {
        image = "ubuntu:latest"
        args  = ["sleep", "infinity"]
      }

      resources {
        cpu    = 500
        memory = 1024

        device "nvidia/gpu" {
          count = 2
        }
      }
    }
  }
}
$ nomad plan 1.hcl
+ Job: "gpu-test-1"
+ Task Group: "test" (1 create)
  + Task: "task-one" (forces create)
  + Task: "task-two" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "test" (failed to place 1 allocation):
    * Class "bronze": 2 nodes excluded by filter
    * Constraint "${attr.unique.hostname} = bar": 1 nodes excluded by filter
    * Constraint "missing devices": 1 nodes excluded by filter

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 1.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

This job however succeeds :

job "gpu-test-2" {
  namespace = "foo
  node_pool = "foo"
  type      = "service"

  constraint {
    attribute = "${attr.unique.hostname}"
    value     = "bar"
  }

  group "test" {
    count = 1

    task "task-one" {
      driver = "docker"

      config {
        image = "ubuntu:latest"
        args  = ["sleep", "infinity"]
      }

      resources {
        cpu    = 500
        memory = 1024

        device "nvidia/gpu" {
          count = 4
        }
      }
    }

    task "task-two" {
      driver = "docker"

      config {
        image = "ubuntu:latest"
        args  = ["sleep", "infinity"]
      }

      resources {
        cpu    = 500
        memory = 1024
      }
    }
  }
}
$ nomad plan 2.hcl
+ Job: "gpu-test-2"
+ Task Group: "test" (1 create)
  + Task: "task-one" (forces create)
  + Task: "task-two" (forces create)

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 2.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

Now - it gets even stranger ...

If I target two nvidia/gpu devices in one task and two nvidia/gpu/NVIDIA H100 NVL MIG 3g.47gb devices in the second task the job also succeeds ...

 job "gpu-test-3" {
  namespace = "foo"
  node_pool = "foo"
  type      = "service"

  constraint {
    attribute = "${attr.unique.hostname}"
    value     = "bar"
  }

  group "test" {
    count = 1

    task "task-one" {
      driver = "docker"

      config {
        image = "ubuntu:latest"
        args  = ["sleep", "infinity"]
      }

      resources {
        cpu    = 500
        memory = 1024

        device "nvidia/gpu" {
          count = 2
        }
      }
    }

    task "task-two" {
      driver = "docker"

      config {
        image = "ubuntu:latest"
        args  = ["sleep", "infinity"]
      }

      resources {
        cpu    = 500
        memory = 1024

        device "nvidia/gpu/NVIDIA H100 NVL MIG 3g.47gb" {
          count = 2
        }
      }
    }
  }
}
$ nomad plan 3.hcl
+ Job: "gpu-test-3"
+ Task Group: "test" (1 create)
  + Task: "task-one" (forces create)
  + Task: "task-two" (forces create)

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 3.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

Expected Result

Nomad should allow the job operator to use the fingerprinted devices across multiple task groups.

Actual Result

Nomad rejects certain jobs with a "missing device" constrains.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Needs Roadmapping

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions