Skip to content

ComputeDomain: support more than one domain per node, with subsets of GPUs #353

@jgehrcke

Description

@jgehrcke

Currently, there can be at most one ComputeDomain (CD) per node. This holds true regardless of the size of the workload in that domain. If the workload does not consume all node-associated GPUs then the remaining GPU resources (provided by that node) cannot be part of any other CD. Hence, the ability to spawn more than one CD per node will help with resource utilization, as indicated in this figure:

Image
The CD indicated above in yellow is something that cannot be spawned today after the green and blue ones are already in place -- building out support for that is the goal tracked here.

Supporting more than one CD per node requires significant implementation changes: currently, there is a one-to-one mapping between CD and IMEX daemons, and only one IMEX daemon can be run per node (a technical constraint that at least for now we take for granted). That is, supporting more than one CD per node in all likelihood requires us to build a one-to-many relationship between IMEX daemon and ComputeDomain. That, on the other hand, might have impact on security isolation:

  • one-to-one mapping (current): each CD uses its own IMEX daemon ensemble (one IMEX domain, one IMEX channel).
  • one-to-many mapping (future): different CDs would be run in the same IMEX daemon ensemble, share an IMEX domain, and be isolated "only" by using different IMEX channels.

Metadata

Metadata

Assignees

Labels

featureissue/PR that proposes a new feature or functionality

Type

No type

Projects

Status

Backlog

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions