Scheduling multiple tasks that consume multiple GPUs in a single job fails even if GPUs are available

### Nomad version
`Nomad v1.9.13+ent`

### Issue
I have a node with 4 GPUs which are correctly fingerprinted by nomad but certain combinations of tasks / device counts are rejected with "missing devices" even though the total number of requested GPU instances never exceeds the number of fingerprinted GPUS (4).

I **can** run a job with a single task group and a device `nvidia/gpu` count of 4.

I **cannot** run a job with two tasks where the device `nvidia/gpu` count per task is 2. 
The plan is rejected with "missing devices".

I **can** run a job with two tasks where device `nvidia/gpu` count is 2 in the first task and device `nvidia/gpu/NVIDIA H100 NVL MIG 3g.47gb` count of 2 in the second task.

Node info : 

$ `nomad node status -verbose <node_id>`

```
[ ... truncated ...]
Device Resource Utilization
nvidia/gpu/NVIDIA H100 NVL MIG 3g.47gb[MIG-076513e0-6b48-5e13-ab80-b8be7c9259a6]  <none>
nvidia/gpu/NVIDIA H100 NVL MIG 3g.47gb[MIG-3e8b3066-e6de-5b42-b70a-c349e5eed28f]  <none>
nvidia/gpu/NVIDIA H100 NVL MIG 3g.47gb[MIG-6715152b-4058-5804-bf60-7652876231e2]  <none>
nvidia/gpu/NVIDIA H100 NVL MIG 3g.47gb[MIG-f2187ec1-76f0-5f4d-8caf-e3bc17901c1a]  <none>
[ ... truncated ...]
```

### Reproduction steps

This jobs **fails** with "missing devices" :

```
job "gpu-test-1" {
  namespace = "foo"
  node_pool = "foo"
  type      = "service"

  constraint {
    attribute = "${attr.unique.hostname}"
    value     = "bar"
  }

  group "test" {
    count = 1

    task "task-one" {
      driver = "docker"

      config {
        image = "ubuntu:latest"
        args  = ["sleep", "infinity"]
      }

      resources {
        cpu    = 500
        memory = 1024

        device "nvidia/gpu" {
          count = 2
        }
      }
    }

    task "task-two" {
      driver = "docker"

      config {
        image = "ubuntu:latest"
        args  = ["sleep", "infinity"]
      }

      resources {
        cpu    = 500
        memory = 1024

        device "nvidia/gpu" {
          count = 2
        }
      }
    }
  }
}
```

```
$ nomad plan 1.hcl
+ Job: "gpu-test-1"
+ Task Group: "test" (1 create)
  + Task: "task-one" (forces create)
  + Task: "task-two" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "test" (failed to place 1 allocation):
    * Class "bronze": 2 nodes excluded by filter
    * Constraint "${attr.unique.hostname} = bar": 1 nodes excluded by filter
    * Constraint "missing devices": 1 nodes excluded by filter

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 1.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
```

This job however **succeeds** :

```
job "gpu-test-2" {
  namespace = "foo
  node_pool = "foo"
  type      = "service"

  constraint {
    attribute = "${attr.unique.hostname}"
    value     = "bar"
  }

  group "test" {
    count = 1

    task "task-one" {
      driver = "docker"

      config {
        image = "ubuntu:latest"
        args  = ["sleep", "infinity"]
      }

      resources {
        cpu    = 500
        memory = 1024

        device "nvidia/gpu" {
          count = 4
        }
      }
    }

    task "task-two" {
      driver = "docker"

      config {
        image = "ubuntu:latest"
        args  = ["sleep", "infinity"]
      }

      resources {
        cpu    = 500
        memory = 1024
      }
    }
  }
}
```

```
$ nomad plan 2.hcl
+ Job: "gpu-test-2"
+ Task Group: "test" (1 create)
  + Task: "task-one" (forces create)
  + Task: "task-two" (forces create)

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 2.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
```

Now - it gets even stranger ...

If I target two `nvidia/gpu` devices in one task and two `nvidia/gpu/NVIDIA H100 NVL MIG 3g.47gb `devices in the second task the job also **succeeds** ...

```
 job "gpu-test-3" {
  namespace = "foo"
  node_pool = "foo"
  type      = "service"

  constraint {
    attribute = "${attr.unique.hostname}"
    value     = "bar"
  }

  group "test" {
    count = 1

    task "task-one" {
      driver = "docker"

      config {
        image = "ubuntu:latest"
        args  = ["sleep", "infinity"]
      }

      resources {
        cpu    = 500
        memory = 1024

        device "nvidia/gpu" {
          count = 2
        }
      }
    }

    task "task-two" {
      driver = "docker"

      config {
        image = "ubuntu:latest"
        args  = ["sleep", "infinity"]
      }

      resources {
        cpu    = 500
        memory = 1024

        device "nvidia/gpu/NVIDIA H100 NVL MIG 3g.47gb" {
          count = 2
        }
      }
    }
  }
}
```

```
$ nomad plan 3.hcl
+ Job: "gpu-test-3"
+ Task Group: "test" (1 create)
  + Task: "task-one" (forces create)
  + Task: "task-two" (forces create)

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 3.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
```


#### Expected Result
Nomad should allow the job operator to use the fingerprinted devices across multiple task groups.

#### Actual Result
Nomad rejects certain jobs with a "missing device" constrains.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scheduling multiple tasks that consume multiple GPUs in a single job fails even if GPUs are available #27102

Nomad version

Issue

Reproduction steps

Expected Result

Actual Result

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scheduling multiple tasks that consume multiple GPUs in a single job fails even if GPUs are available #27102

Description

Nomad version

Issue

Reproduction steps

Expected Result

Actual Result

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions