Skip to content

Conversation

vymao
Copy link

@vymao vymao commented Aug 26, 2025

If we aren't using the Metaflow metadata service provider, Metaflow defaults to generating task IDs locally. But these task IDs are just simple integers based on how many tasks/steps there are and are sequentially incremented based on new_task_id in metaflow/plugins/metadata_providers/local.py. This presents a problem when we're doing AWS Batch MNP, since currently we try and mass replace based on the task ID in the secondary command. If this is a simple integer, this will replace many erroneous places.

For example, if the task ID is "3", there could be many instances of "3" in the secondary command that then have many replacements with "-node-$AWS_BATCH_JOB_NODE_INDEX" when really we just want to replace the actual task ID.

Here, I've identified two places - the input task ID via --task-id and the task ID in MF_PATHSPEC, that should be the only two places in the command that have the actual task ID in them that need replacing. It is better to have more specific regexes this way.

self._task_id.replace("control-", "")
+ "-node-$AWS_BATCH_JOB_NODE_INDEX",
)
# Fix: Only replace task ID in specific arguments, not in environment variables
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the fix! @saikonen any suggestions on approaches that might allow us to not lean on string substitution (quite finicky)!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't touched this implementation in a bit so maybe there is a reason that the task-id patterns are kept in execute() but at first glance this seems (in the original implementation) like quite a late phase to be doing anything with the command.

Could a better place be when the cmd is not yet joined into a string, when we can operate on options separately? f.ex. batch_cli.py#251 and step_functions.py#925 ?

Copy link
Author

@vymao vymao Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure of the flow but I think that is possible if the flag --task-id is the only thing that needs to be changed. Looking at the command, I'm not sure if there is something else that requires us to modify the task ID (ex. MF_PATHSPEC). If task-id is the only thing, then I can definitely make that change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the code. I think it is difficult to fully rely on batch_cli.py#251 because we don't have access to $AWS_BATCH_JOB_NODE_INDEX as this would only be available at runtime for a given worker MNP node, I believe. But I've added more placeholders in batch_cli.py#251 that should make the regex more reliable. I'm less familiar with the step_functions file as I'm not using that right now.

Is it possible to get approval on this soon?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@savingoyal savingoyal requested a review from saikonen August 26, 2025 21:43
@vymao vymao requested a review from savingoyal August 28, 2025 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants