Skip to content

Add shm_size support to AzureML Orchestrator #4329

@dani-diffusedrive

Description

@dani-diffusedrive

Contact Details [Optional]

No response

Feature Description

Add support for configurable shared memory (shm_size) parameter in AzureMLOrchestratorSettings. The underlying AzureML SDK v2 (JobResourceConfiguration) already supports this parameter, but ZenML's integration does not expose it to users.

Problem or Use Case

Problem:
When running ML pipelines on AzureML through ZenML, containers default to around 64MB of shared memory (/dev/shm). This is insufficient for:

  • PyTorch dataloaders with multi-worker loading
  • Large model caching in RAM
  • In-memory data processing for large-scale ML workflows

Use Case:
Users working with large models (e.g., diffusion models, LLMs) need to cache model weights and intermediate tensors in shared memory. Without configurable shm_size, these pipelines fail with out-of-memory errors in shared memory, even when the compute instance has sufficient RAM.

Current Limitation:
The AzureML SDK v2's JobResourceConfiguration class supports shm_size, but ZenML's AzureMLOrchestratorSettings does not expose this parameter, requiring users to manually patch the ZenML source code.

Proposed Solution

Add a shm_size field to the AzureMLOrchestratorSettings class and pass it through to the underlying AzureML JobResourceConfiguration.

Implementation Changes:

1. Update AzureMLOrchestratorSettings:

File: zenml/integrations/azure/flavors/azureml_orchestrator_flavor.py

class AzureMLOrchestratorSettings(AzureMLComputeSettings):
    """Settings for the AzureML orchestrator."""

    synchronous: bool = Field(
        default=True,
        description="Whether the orchestrator runs synchronously or not.",
    )
    
    # Add shm_size field
    shm_size: Optional[str] = Field(
        default=None,
        description="Size of the shared memory block (e.g. '2g', '200g').",
    )

2. Update AzureMLOrchestrator:

File: zenml/integrations/azure/orchestrators/azureml_orchestrator.py

Add import at the top of the file:

from azure.ai.ml.entities import (
    CommandComponent,
    JobResourceConfiguration,  # Add this import
    # ... other imports
)

Update the _create_command_component method signature and implementation:

@staticmethod
def _create_command_component(
    step: Step,
    image: str,
    command: List[str],
    environment_variables: Dict[str, str],
    environment: str,
    shm_size: Optional[str] = None,  # Add this parameter
) -> CommandComponent:
    """Create an AzureML CommandComponent for a pipeline step.
    
    Args:
        step: The ZenML step to create a component for.
        image: Docker image to use.
        command: Command to execute.
        environment_variables: Environment variables.
        environment: Environment name.
        shm_size: Optional shared memory size (e.g. '2g', '200g').
    
    Returns:
        CommandComponent configured for the step.
    """
    # Create resource configuration if shm_size is specified
    resources = None
    if shm_size:
        resources = JobResourceConfiguration(shm_size=shm_size)
    
    return CommandComponent(
        name=step.config.name,
        display_name=step.config.name,
        description=step.config.docstring,
        command=" ".join(command),
        environment=environment,
        environment_variables=environment_variables,
        resources=resources,  # Pass the resources configuration
    )

Update the submit_pipeline method to pass shm_size:

def submit_pipeline(
    self,
    deployment: "PipelineDeploymentResponse",
    stack: "Stack",
) -> None:
    """Submit a pipeline to AzureML."""
    # ... existing code ...
    
    
    for step_name, step in deployment.step_configurations.items():

        # ... existing code ...

        components[step_name] = self._create_command_component(
            step=step,
            image=docker_image_name,
            command=command,
            environment_variables=env_vars,
            environment=environment_name,
            shm_size=settings.shm_size,  # Pass shm_size from settings
        )

User Experience:

from zenml.integrations.azure.flavors import AzureMLOrchestratorSettings

# Configure orchestrator with custom shared memory size
settings = AzureMLOrchestratorSettings(
    mode="compute-cluster",
    compute_name="gpu-cluster",
    shm_size="200g"  # Set shared memory to 200GB
)

# Use in pipeline execution
pipeline.with_options(settings={"orchestrator": settings})()

Proposed Documentation:

**shm_size** (`Optional[str]`): Size of the shared memory block for the container (e.g., `"2g"`, `"200g"`). 
Useful when working with large models, PyTorch dataloaders with multiple workers, or RAM-based caching. 
Defaults to AzureML's default (typically 64MB).

Alternatives Considered

1. Manual Patching (Current Workaround):

  • Manually edit ZenML source files in site-packages
  • Drawbacks:
    • Not maintainable across ZenML updates
    • Requires modifying installed packages
    • Not portable across environments

2. Using Docker --shm-size flag:

  • This approach doesn't apply to AzureML's managed job execution
  • AzureML's container configuration requires using the SDK's JobResourceConfiguration

Additional Context

Code Implementation Reference

The manual patch has been successfully tested and requires changes to two files:

File 1: zenml/integrations/azure/flavors/azureml_orchestrator_flavor.py

  • Add shm_size: Optional[str] field to AzureMLOrchestratorSettings

File 2: zenml/integrations/azure/orchestrators/azureml_orchestrator.py

  • Import: from azure.ai.ml.entities import JobResourceConfiguration
  • Modify: _create_command_component method to accept and use shm_size
  • Modify: submit_pipeline method to pass settings.shm_size to component creation

Technical Details

  • The AzureML SDK v2 already supports this via JobResourceConfiguration.shm_size
  • Format: String with size notation (e.g., "2g", "200g", "1024m")
  • Default behavior: When None, AzureML uses default (typically 64MB)
  • No breaking changes: The parameter is optional with default=None

Note

The Code of Conduct link in this feature request template appears to be broken (shows "Not found" error).

Priority

High - Critical for my use case

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

core-teamIssues that are being handled by the core teamplannedPlanned for the short termsnackx-squadIssues that are being handled by the x-squad

Projects

Status

Released

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions