Skip to content

Conversation

@wanglei19991004
Copy link
Contributor

Fixed the issue where switches do not take effect under multi-machine operation.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @wanglei19991004, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the log collection and diagnostic reporting mechanisms, particularly for distributed training environments. It introduces a more robust approach by enabling individual nodes to perform self-monitoring, which resolves previous issues where monitoring configurations were not correctly applied across multiple machines. The changes standardize diagnostic file locations, enhance the configuration of monitoring services with node-specific details, and provide better control over monitoring activation through configuration files, alongside implementing cleanup procedures for disabled monitoring.

Highlights

  • Enhanced Multi-Node Monitoring: Introduced a 'single-node monitoring mode' where each node independently manages its log collection and diagnostic report generation, addressing issues in multi-machine operations where configurations might not have been applied correctly.
  • Standardized Diagnostic File Paths: Diagnostic files are now consistently stored in a dedicated 'monitor' subdirectory within the base log directory, improving organization and reliability across different execution environments.
  • Improved Monitor Service Configuration: The monitor launcher and MonitorService now accept explicit host and node_rank arguments, allowing for precise, node-specific monitoring and diagnostic actions.
  • Configurable Monitoring: The enable_monitoring flag can now be configured via the experiment configuration, providing greater flexibility and centralized control over monitoring behavior.
  • Cleanup for Disabled Monitoring: When monitoring is disabled, the system now includes logic to clean up old monitor-related files for specific hosts and nodes, preventing stale data accumulation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix log collection and diagnostics in multi-machine environments. The changes introduce a single_node_mode for the MonitorService, allowing each node to monitor itself. This is achieved by passing host and node_rank information down from the runner scripts to the monitor service. The changes look mostly correct and improve the monitoring architecture for distributed setups.

I've found a critical issue in CloudTrainRunner where a method is called with a new parameter but its signature hasn't been updated, which will lead to a runtime error. I've also left a couple of medium-severity comments regarding a typo and code duplication that could improve maintainability. Please address the critical issue before merging.

@CLAassistant
Copy link

CLAassistant commented Nov 18, 2025

CLA assistant check
All committers have signed the CLA.

@zhaoyinglia zhaoyinglia changed the title [Elastic] Fix log collection and diagnostic [Elastic] Fix log collection and diagnostic for multi-node Nov 19, 2025
"--pid-file", required=True, help="The path of the PID file for the training process"
)
parser.add_argument("--host", required=True, help="Hostname or IP of this node")
parser.add_argument("--node-rank", type=int, required=True, help="Node rank of this node")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two arguments are both required ... that means the launcher can be used only for single-node scenario. Am I understanding this correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two arguments are used to generate a separate diagnostic file for each host + node pair.
The current implementation actually supports multi-host and multi-node scenarios, not just a single node.

):
# Read from config if not explicitly provided
if enable_monitoring is None:
enable_monitoring = self.config.experiment.runner.get("enable_monitoring", True)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a behavior change ...
The monitoring was defaulted to False previously, now it is defaulted to True.
Is this alright?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry!It should be defaulted to False, and I will fix it accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants