-
Notifications
You must be signed in to change notification settings - Fork 563
Description
Contact Details [Optional]
No response
Feature Description
Add support for configurable shared memory (shm_size) parameter in AzureMLOrchestratorSettings. The underlying AzureML SDK v2 (JobResourceConfiguration) already supports this parameter, but ZenML's integration does not expose it to users.
Problem or Use Case
Problem:
When running ML pipelines on AzureML through ZenML, containers default to around 64MB of shared memory (/dev/shm). This is insufficient for:
- PyTorch dataloaders with multi-worker loading
- Large model caching in RAM
- In-memory data processing for large-scale ML workflows
Use Case:
Users working with large models (e.g., diffusion models, LLMs) need to cache model weights and intermediate tensors in shared memory. Without configurable shm_size, these pipelines fail with out-of-memory errors in shared memory, even when the compute instance has sufficient RAM.
Current Limitation:
The AzureML SDK v2's JobResourceConfiguration class supports shm_size, but ZenML's AzureMLOrchestratorSettings does not expose this parameter, requiring users to manually patch the ZenML source code.
Proposed Solution
Add a shm_size field to the AzureMLOrchestratorSettings class and pass it through to the underlying AzureML JobResourceConfiguration.
Implementation Changes:
1. Update AzureMLOrchestratorSettings:
File: zenml/integrations/azure/flavors/azureml_orchestrator_flavor.py
class AzureMLOrchestratorSettings(AzureMLComputeSettings):
"""Settings for the AzureML orchestrator."""
synchronous: bool = Field(
default=True,
description="Whether the orchestrator runs synchronously or not.",
)
# Add shm_size field
shm_size: Optional[str] = Field(
default=None,
description="Size of the shared memory block (e.g. '2g', '200g').",
)2. Update AzureMLOrchestrator:
File: zenml/integrations/azure/orchestrators/azureml_orchestrator.py
Add import at the top of the file:
from azure.ai.ml.entities import (
CommandComponent,
JobResourceConfiguration, # Add this import
# ... other imports
)Update the _create_command_component method signature and implementation:
@staticmethod
def _create_command_component(
step: Step,
image: str,
command: List[str],
environment_variables: Dict[str, str],
environment: str,
shm_size: Optional[str] = None, # Add this parameter
) -> CommandComponent:
"""Create an AzureML CommandComponent for a pipeline step.
Args:
step: The ZenML step to create a component for.
image: Docker image to use.
command: Command to execute.
environment_variables: Environment variables.
environment: Environment name.
shm_size: Optional shared memory size (e.g. '2g', '200g').
Returns:
CommandComponent configured for the step.
"""
# Create resource configuration if shm_size is specified
resources = None
if shm_size:
resources = JobResourceConfiguration(shm_size=shm_size)
return CommandComponent(
name=step.config.name,
display_name=step.config.name,
description=step.config.docstring,
command=" ".join(command),
environment=environment,
environment_variables=environment_variables,
resources=resources, # Pass the resources configuration
)Update the submit_pipeline method to pass shm_size:
def submit_pipeline(
self,
deployment: "PipelineDeploymentResponse",
stack: "Stack",
) -> None:
"""Submit a pipeline to AzureML."""
# ... existing code ...
for step_name, step in deployment.step_configurations.items():
# ... existing code ...
components[step_name] = self._create_command_component(
step=step,
image=docker_image_name,
command=command,
environment_variables=env_vars,
environment=environment_name,
shm_size=settings.shm_size, # Pass shm_size from settings
)User Experience:
from zenml.integrations.azure.flavors import AzureMLOrchestratorSettings
# Configure orchestrator with custom shared memory size
settings = AzureMLOrchestratorSettings(
mode="compute-cluster",
compute_name="gpu-cluster",
shm_size="200g" # Set shared memory to 200GB
)
# Use in pipeline execution
pipeline.with_options(settings={"orchestrator": settings})()Proposed Documentation:
**shm_size** (`Optional[str]`): Size of the shared memory block for the container (e.g., `"2g"`, `"200g"`).
Useful when working with large models, PyTorch dataloaders with multiple workers, or RAM-based caching.
Defaults to AzureML's default (typically 64MB).Alternatives Considered
1. Manual Patching (Current Workaround):
- Manually edit ZenML source files in
site-packages - Drawbacks:
- Not maintainable across ZenML updates
- Requires modifying installed packages
- Not portable across environments
2. Using Docker --shm-size flag:
- This approach doesn't apply to AzureML's managed job execution
- AzureML's container configuration requires using the SDK's
JobResourceConfiguration
Additional Context
Code Implementation Reference
The manual patch has been successfully tested and requires changes to two files:
File 1: zenml/integrations/azure/flavors/azureml_orchestrator_flavor.py
- Add
shm_size: Optional[str]field toAzureMLOrchestratorSettings
File 2: zenml/integrations/azure/orchestrators/azureml_orchestrator.py
- Import:
from azure.ai.ml.entities import JobResourceConfiguration - Modify:
_create_command_componentmethod to accept and useshm_size - Modify:
submit_pipelinemethod to passsettings.shm_sizeto component creation
Technical Details
- The AzureML SDK v2 already supports this via
JobResourceConfiguration.shm_size - Format: String with size notation (e.g.,
"2g","200g","1024m") - Default behavior: When
None, AzureML uses default (typically 64MB) - No breaking changes: The parameter is optional with
default=None
Note
The Code of Conduct link in this feature request template appears to be broken (shows "Not found" error).
Priority
High - Critical for my use case
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
Type
Projects
Status