-
Notifications
You must be signed in to change notification settings - Fork 122
[Elastic] Fix log collection and diagnostic for multi-node #882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @wanglei19991004, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refines the log collection and diagnostic reporting mechanisms, particularly for distributed training environments. It introduces a more robust approach by enabling individual nodes to perform self-monitoring, which resolves previous issues where monitoring configurations were not correctly applied across multiple machines. The changes standardize diagnostic file locations, enhance the configuration of monitoring services with node-specific details, and provide better control over monitoring activation through configuration files, alongside implementing cleanup procedures for disabled monitoring. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix log collection and diagnostics in multi-machine environments. The changes introduce a single_node_mode for the MonitorService, allowing each node to monitor itself. This is achieved by passing host and node_rank information down from the runner scripts to the monitor service. The changes look mostly correct and improve the monitoring architecture for distributed setups.
I've found a critical issue in CloudTrainRunner where a method is called with a new parameter but its signature hasn't been updated, which will lead to a runtime error. I've also left a couple of medium-severity comments regarding a typo and code duplication that could improve maintainability. Please address the critical issue before merging.
| "--pid-file", required=True, help="The path of the PID file for the training process" | ||
| ) | ||
| parser.add_argument("--host", required=True, help="Hostname or IP of this node") | ||
| parser.add_argument("--node-rank", type=int, required=True, help="Node rank of this node") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two arguments are both required ... that means the launcher can be used only for single-node scenario. Am I understanding this correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two arguments are used to generate a separate diagnostic file for each host + node pair.
The current implementation actually supports multi-host and multi-node scenarios, not just a single node.
flagscale/runner/runner_train.py
Outdated
| ): | ||
| # Read from config if not explicitly provided | ||
| if enable_monitoring is None: | ||
| enable_monitoring = self.config.experiment.runner.get("enable_monitoring", True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a behavior change ...
The monitoring was defaulted to False previously, now it is defaulted to True.
Is this alright?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry!It should be defaulted to False, and I will fix it accordingly.
Fixed the issue where switches do not take effect under multi-machine operation.