[Elastic] Fix log collection and diagnostic for multi-node #882

wanglei19991004 · 2025-10-29T00:39:41Z

Fixed the issue where switches do not take effect under multi-machine operation.

gemini-code-assist · 2025-10-29T00:40:00Z

Summary of Changes

Hello @wanglei19991004, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the log collection and diagnostic reporting mechanisms, particularly for distributed training environments. It introduces a more robust approach by enabling individual nodes to perform self-monitoring, which resolves previous issues where monitoring configurations were not correctly applied across multiple machines. The changes standardize diagnostic file locations, enhance the configuration of monitoring services with node-specific details, and provide better control over monitoring activation through configuration files, alongside implementing cleanup procedures for disabled monitoring.

Highlights

Enhanced Multi-Node Monitoring: Introduced a 'single-node monitoring mode' where each node independently manages its log collection and diagnostic report generation, addressing issues in multi-machine operations where configurations might not have been applied correctly.
Standardized Diagnostic File Paths: Diagnostic files are now consistently stored in a dedicated 'monitor' subdirectory within the base log directory, improving organization and reliability across different execution environments.
Improved Monitor Service Configuration: The monitor launcher and MonitorService now accept explicit host and node_rank arguments, allowing for precise, node-specific monitoring and diagnostic actions.
Configurable Monitoring: The enable_monitoring flag can now be configured via the experiment configuration, providing greater flexibility and centralized control over monitoring behavior.
Cleanup for Disabled Monitoring: When monitoring is disabled, the system now includes logic to clean up old monitor-related files for specific hosts and nodes, preventing stale data accumulation.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to fix log collection and diagnostics in multi-machine environments. The changes introduce a single_node_mode for the MonitorService, allowing each node to monitor itself. This is achieved by passing host and node_rank information down from the runner scripts to the monitor service. The changes look mostly correct and improve the monitoring architecture for distributed setups.

I've found a critical issue in CloudTrainRunner where a method is called with a new parameter but its signature hasn't been updated, which will lead to a runtime error. I've also left a couple of medium-severity comments regarding a typo and code duplication that could improve maintainability. Please address the critical issue before merging.

CLAassistant · 2025-11-18T13:08:23Z

All committers have signed the CLA.

tengqm · 2025-11-20T08:54:33Z

flagscale/runner/elastic/monitor_launcher.py

        "--pid-file", required=True, help="The path of the PID file for the training process"
    )
+    parser.add_argument("--host", required=True, help="Hostname or IP of this node")
+    parser.add_argument("--node-rank", type=int, required=True, help="Node rank of this node")


These two arguments are both required ... that means the launcher can be used only for single-node scenario. Am I understanding this correctly?

These two arguments are used to generate a separate diagnostic file for each host + node pair.
The current implementation actually supports multi-host and multi-node scenarios, not just a single node.

tengqm · 2025-11-20T08:59:09Z

flagscale/runner/runner_train.py

    ):
+        # Read from config if not explicitly provided
+        if enable_monitoring is None:
+            enable_monitoring = self.config.experiment.runner.get("enable_monitoring", True)


This is a behavior change ...
The monitoring was defaulted to False previously, now it is defaulted to True.
Is this alright?

Sorry！It should be defaulted to False, and I will fix it accordingly.

wanglei19991004 and others added 2 commits October 21, 2025 17:35

fix multi machine monitor issues

ca68b58

Merge branch 'FlagOpen:main' into elastic

a4a2104

wanglei19991004 requested a review from aoyulong as a code owner October 29, 2025 00:39

Merge branch 'main' into elastic

03d7a7b

gemini-code-assist bot reviewed Oct 29, 2025

View reviewed changes

Merge branch 'main' into elastic

aaa8a1b

zhaoyinglia changed the title ~~[Elastic] Fix log collection and diagnostic~~ [Elastic] Fix log collection and diagnostic for multi-node Nov 19, 2025

remove overwrite arg and else

9085ca5

tengqm reviewed Nov 20, 2025

View reviewed changes

fix enable_monitoring default config

8c1a1eb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Elastic] Fix log collection and diagnostic for multi-node #882

[Elastic] Fix log collection and diagnostic for multi-node #882

wanglei19991004 commented Oct 29, 2025

Uh oh!

gemini-code-assist bot commented Oct 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

CLAassistant commented Nov 18, 2025 •

edited

Loading

Uh oh!

tengqm Nov 20, 2025

Uh oh!

wanglei19991004 Nov 25, 2025

Uh oh!

tengqm Nov 20, 2025

Uh oh!

wanglei19991004 Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Elastic] Fix log collection and diagnostic for multi-node #882

Are you sure you want to change the base?

[Elastic] Fix log collection and diagnostic for multi-node #882

Conversation

wanglei19991004 commented Oct 29, 2025

Uh oh!

gemini-code-assist bot commented Oct 29, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

CLAassistant commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tengqm Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

wanglei19991004 Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

tengqm Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

wanglei19991004 Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Nov 18, 2025 •

edited

Loading