[Q&A] I was wondering if someone tested the llm_hf example in production mode #3685

oded-byte · 2025-09-11T18:35:33Z

oded-byte
Sep 11, 2025

Python version (`python3 -V`)

3.12

NVFlare version (`python3 -m pip list | grep "nvflare"`)

2.6.2

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, `git branch`)

2.6

Operating system

Ubuntu 24

Have you successfully run any of the following examples?

hello-numpy-sag with simulator
hello-pt with simulator
hello-numpy-sag with POC
hello-pt with POC

Please describe your question

I successfully run the llm_hf in production mode when working with a very small model (gpt2-tiny) - just for a sanity check. (to test provisioning and other methods work fine).
later, when I run the computation with the model in the example llama. I get a weird error.
The server sends the initial model to the client (he doesn't need to, but whatever) the client tries to receive but fails.
here are the logs.

2025-09-11 17:40:52,409 - ClientEngine - INFO - Starting client app. rank: 0
I0000 00:00:1757612452.419902     812 fork_posix.cc:71] Other threads are currently calling into gRPC, skipping fork() handlers
2025-09-11 17:40:52,426 - ClientProcessJobLauncher - INFO - Launch the job in process ID: 4587
2025-09-11 17:40:52,427 - JobExecutor - INFO - Launched job c967f739-b5f9-4457-92de-b978f67f990a with job launcher: <class 'nvflare.app_common.job_launcher.client_process_launcher.ClientProcessJobLauncher'> 
2025-09-11 17:40:52,430 - JobExecutor - INFO - run (c967f739-b5f9-4457-92de-b978f67f990a): waiting for child worker process to finish.
2025-09-11 17:40:53,301 - worker_process - INFO - Worker_process started.
2025-09-11 17:40:56,861 - AuxRunner - INFO - registered aux handler for topic ObjectStreamer.Request
2025-09-11 17:40:56,862 - AuxRunner - INFO - registered aux handler for topic ObjectStreamer.Abort
2025-09-11 17:40:56,862 - AuxRunner - INFO - registered aux handler for topic __end_run__
2025-09-11 17:40:56,863 - AuxRunner - INFO - registered aux handler for topic __do_task__
2025-09-11 17:40:57,172 - CoreCell - INFO - site-1.c967f739-b5f9-4457-92de-b978f67f990a: created backbone internal connector to tcp://localhost:5203 on parent
2025-09-11 17:40:57,173 - CoreCell - INFO - site-1.c967f739-b5f9-4457-92de-b978f67f990a: created backbone external connector to grpc://34.122.221.77:8002
2025-09-11 17:40:57,174 - conn_manager - INFO - Connector [CH00001 ACTIVE tcp://localhost:5203] is starting
2025-09-11 17:40:57,174 - conn_manager - INFO - Connector [CH00002 ACTIVE grpc://34.122.221.77:8002] is starting
2025-09-11 17:40:57,175 - conn_manager - INFO - Connection [CN00009 127.0.0.1:5203 <= 127.0.0.1:55946] is created: PID: 797
2025-09-11 17:40:57,176 - conn_manager - INFO - Connection [CN00003 127.0.0.1:55946 => 127.0.0.1:5203] is created: PID: 4587
2025-09-11 17:40:57,177 - Cell - INFO - Register blob CB for channel='server_command', topic='get_task'
2025-09-11 17:40:57,178 - Cell - INFO - Register blob CB for channel='server_command', topic='submit_update'
2025-09-11 17:40:57,178 - FederatedClient - INFO - Wait for client_runner to be created.
2025-09-11 17:40:57,178 - FederatedClient - INFO - Got client_runner after 0.00016951560974121094 seconds
2025-09-11 17:40:57,179 - FederatedClient - INFO - Got the new primary SP: grpc://34.122.221.77:8002
2025-09-11 17:40:57,180 - Cell - INFO - Register blob CB for channel='aux_communication', topic='*'
2025-09-11 17:40:57,185 - ClientAppRunner - INFO - notified status 2 to site-1 in 0.004433631896972656 seconds after 1 tries
2025-09-11 17:40:57,186 - AuxRunner - INFO - registered aux handler for topic __sync_runner__
2025-09-11 17:40:57,190 - AuxRunner - INFO - registered aux handler for topic __job_heartbeat__
2025-09-11 17:40:57,190 - AuxRunner - INFO - registered aux handler for topic __task_check__
2025-09-11 17:40:57,200 - GrpcDriver - INFO - created secure channel at 34.122.221.77:8002
2025-09-11 17:40:57,200 - conn_manager - INFO - Connection [CN00004 N/A => 34.122.221.77:8002] is created: PID: 4587
2025-09-11 17:40:57,690 - ClientRunner - INFO - [identity=site-1, run=c967f739-b5f9-4457-92de-b978f67f990a] - syncing to parent server ...
2025-09-11 17:40:58,209 - ClientRunner - INFO - [identity=site-1, run=c967f739-b5f9-4457-92de-b978f67f990a] - syncing to parent server ...
2025-09-11 17:40:58,720 - ClientRunner - INFO - [identity=site-1, run=c967f739-b5f9-4457-92de-b978f67f990a] - syncing to parent server ...
2025-09-11 17:40:59,299 - ClientRunner - INFO - [identity=site-1, run=c967f739-b5f9-4457-92de-b978f67f990a] - syncing to parent server ...
2025-09-11 17:40:59,307 - ClientRunner - INFO - [identity=site-1, run=c967f739-b5f9-4457-92de-b978f67f990a] - synced to parent server in 2.116234540939331 seconds
2025-09-11 17:40:59,307 - AuxRunner - INFO - registered aux handler for topic RM.RELIABLE_REQUEST
2025-09-11 17:40:59,307 - AuxRunner - INFO - registered aux handler for topic RM.RELIABLE_REPLY
2025-09-11 17:40:59,308 - ReliableMessage - INFO - enabled reliable message: max_request_workers=20 query_interval=2.0
2025-09-11 17:40:59,309 - CoreCell - INFO - site-1.site-1_c967f739-b5f9-4457-92de-b978f67f990a_passive: created backbone internal connector to tcp://localhost:5203 on parent
2025-09-11 17:40:59,309 - conn_manager - INFO - Connector [CH00003 ACTIVE tcp://localhost:5203] is starting
2025-09-11 17:40:59,310 - conn_manager - INFO - Connection [CN00005 127.0.0.1:55960 => 127.0.0.1:5203] is created: PID: 4587
2025-09-11 17:40:59,311 - Cell - INFO - Register blob CB for channel='cell_pipe.metric', topic='*'
2025-09-11 17:40:59,311 - conn_manager - INFO - Connection [CN00010 127.0.0.1:5203 <= 127.0.0.1:55960] is created: PID: 797
2025-09-11 17:40:59,312 - CellPipe - INFO - registered CellPipe request CB for cell_pipe.metric
2025-09-11 17:40:59,313 - Cell - INFO - Register blob CB for channel='cell_pipe.task', topic='*'
2025-09-11 17:40:59,313 - CellPipe - INFO - registered CellPipe request CB for cell_pipe.task
2025-09-11 17:40:59,316 - AuxRunner - INFO - registered aux handler for topic fed.event
2025-09-11 17:40:59,317 - ClientRunner - INFO - [identity=site-1, run=c967f739-b5f9-4457-92de-b978f67f990a] - client runner started
2025-09-11 17:41:06,164 - byte_receiver - ERROR - Rx[SID:1757617143644380 from server.c967f739-b5f9-4457-92de-b978f67f990a for return_only/server_command:get_task Size: 5993994976] callback <bound method BlobHandler.handle_blob_cb of <nvflare.fuel.f3.streaming.blob_streamer.BlobHandler object at 0x7bd3f4a8f200>> throws exception: 
2025-09-11 17:41:07,957 - SubprocessLauncher - INFO - No distributed training environment detected, running in single GPU mode
2025-09-11 17:41:08,032 - SubprocessLauncher - INFO - Attempting to remove output path sft.
2025-09-11 17:41:08,033 - SubprocessLauncher - INFO - Output path sft does not exist, skipping removal.
2025-09-11 17:41:08,304 - SubprocessLauncher - INFO - Dataset size: training 12009, validation 1501
2025-09-11 17:41:08,305 - SubprocessLauncher - INFO - logging_steps: 15
2025-09-11 17:41:21,176 - conn_manager - INFO - Connection [CN00011 127.0.0.1:5203 <= 127.0.0.1:35486] is created: PID: 797
2025-09-11 17:42:21,259 - conn_manager - INFO - Connection [CN00011 Not Connected] is closed PID: 797
2025-09-11 17:42:25,091 - PTClientAPILauncherExecutor - INFO - [identity=site-1, run=c967f739-b5f9-4457-92de-b978f67f990a] - launcher completed with status success at time 1757612545.0910914
2025-09-11 17:50:59,329 - Communicator - WARNING - Failed to get_task from server. Will try it again.
2025-09-11 17:50:59,330 - FederatedClient - INFO - pull_task completed. Task name:None Status:False 
2025-09-11 17:51:06,579 - byte_receiver - ERROR - Rx[SID:1757617143644381 from server.c967f739-b5f9-4457-92de-b978f67f990a for return_only/server_command:get_task Size: 5993994976] callback <bound method BlobHandler.handle_blob_cb of <nvflare.fuel.f3.streaming.blob_streamer.BlobHandler object at 0x7bd3f4a8f200>> throws exception: 
2025-09-11 18:00:59,840 - Communicator - WARNING - Failed to get_task from server. Will try it again.
2025-09-11 18:00:59,841 - FederatedClient - INFO - pull_task completed. Task name:None Status:False 
2025-09-11 18:01:07,387 - byte_receiver - ERROR - Rx[SID:1757617143644382 from server.c967f739-b5f9-4457-92de-b978f67f990a for return_only/server_command:get_task Size: 5993994976] callback <bound method BlobHandler.handle_blob_cb of <nvflare.fuel.f3.streaming.blob_streamer.BlobHandler object at 0x7bd3f4a8f200>> throws exception: 
2025-09-11 18:11:00,351 - Communicator - WARNING - Failed to get_task from server. Will try it again.
2025-09-11 18:11:00,352 - FederatedClient - INFO - pull_task completed. Task name:None Status:False 
2025-09-11 18:11:08,023 - byte_receiver - ERROR - Rx[SID:1757617143644383 from server.c967f739-b5f9-4457-92de-b978f67f990a for return_only/server_command:get_task Size: 5993994976] callback <bound method BlobHandler.handle_blob_cb of <nvflare.fuel.f3.streaming.blob_streamer.BlobHandler object at 0x7bd3f4a8f200>> throws exception:

here are my config files (added expanded timeouts for testing):
server:

{
    "format_version": 2,
    "submit_task_result_timeout": 3600,
    "communication_timeout": 3600,
    "streaming_ack_wait": 60,
    "streaming_read_timeout": 600,

    "workflows": [
        {
            "id": "controller",
            "path": "nvflare.app_common.workflows.fedavg.FedAvg",
            "args": {
                "num_clients": 1,
                "num_rounds": 3
            }
        }
    ],
    "components": [
        {
            "id": "persistor",
            "path": "nvflare.app_opt.pt.file_model_persistor.PTFileModelPersistor",
            "args": {
                "model": {
                    "path": "src.hf_sft_model.CausalLMModel",
                    "args": {
                        "model_name_or_path": "meta-llama/llama-3.2-1b"
                    }
                },
                "allow_numpy_conversion": false
            }
        },
        {
            "id": "model_selector",
            "path": "nvflare.app_common.widgets.intime_model_selector.IntimeModelSelector",
            "args": {
                "aggregation_weights": {},
                "key_metric": "eval_loss",
                "negate_key_metric": true
            }
        }
    ],
    "task_data_filters": [],
    "task_result_filters": []
}

client:

{
    "format_version": 2,
    "get_task_timeout": 600,
    "submit_task_result_timeout": 3600,
    "executors": [
        {
            "tasks": [
                "train"
            ],
            "executor": {
                "path": "nvflare.app_opt.pt.client_api_launcher_executor.PTClientAPILauncherExecutor",
                "args": {
                    "pipe_id": "pipe",
                    "launcher_id": "launcher",
                    "server_expected_format": "pytorch",
                    "last_result_transfer_timeout": 3600
                }
            }
        }
    ],
    "components": [
        {
            "id": "pipe",
            "path": "nvflare.fuel.utils.pipe.cell_pipe.CellPipe",
            "args": {
                "mode": "PASSIVE",
                "site_name": "{SITE_NAME}",
                "token": "{JOB_ID}",
                "root_url": "{CP_URL}",
                "secure_mode": "{SECURE_MODE}",
                "workspace_dir": "{WORKSPACE}"
            }
        },
        {
            "id": "launcher",
            "path": "nvflare.app_common.launchers.subprocess_launcher.SubprocessLauncher",
            "args": {
                "script": "python3 -u custom/src/hf_sft_peft_fl.py --model_name_or_path meta-llama/llama-3.2-1b --data_path_train /home/duality/dataset/site-1/training.jsonl --data_path_valid /home/duality/dataset/site-1/validation.jsonl --output_path sft --train_mode SFT --message_mode tensor --num_rounds 3"
            }
        },
        {
            "id": "metrics_pipe",
            "path": "nvflare.fuel.utils.pipe.cell_pipe.CellPipe",
            "args": {
                "mode": "PASSIVE",
                "site_name": "{SITE_NAME}",
                "token": "{JOB_ID}",
                "root_url": "{CP_URL}",
                "secure_mode": "{SECURE_MODE}",
                "workspace_dir": "{WORKSPACE}"
            }
        },
        {
            "id": "metric_relay",
            "path": "nvflare.app_common.widgets.metric_relay.MetricRelay",
            "args": {
                "pipe_id": "metrics_pipe",
                "heartbeat_timeout": 0,
                "event_type": "fed.analytix_log_stats"
            }
        },
        {
            "id": "config_preparer",
            "path": "nvflare.app_common.widgets.external_configurator.ExternalConfigurator",
            "args": {
                "component_ids": [
                    "metric_relay"
                ]
            }
        }
    ],
    "task_data_filters": [],
    "task_result_filters": []
}

Would appreciate assistance

Answered by ZiyueXu77

Sep 18, 2025

Hi @oded-byte, we found the root cause, it is these two timeouts:

peer_read_timeout (Line36) and heartbeat_timeout (Line40),
https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_opt/pt/client_api_launcher_executor.py#L36-L40

setting both to 300 and the issue's gone on my side, you can also test and adjust on your machine as the speed of each machine’s different, as you noticed, faster machine can work with defaults.

Thanks for noticing and raising this! We will update our APIs accordingly and figure out a good way to have these timeouts set properly.

View full answer

chesterxgchen · 2025-09-12T02:46:36Z

chesterxgchen
Sep 12, 2025
Maintainer

@ZiyueXu77 can you help check this, see if it the timeout due to model size ?

3 replies

oded-byte Sep 12, 2025
Author

Hey @chesterxgchen,
Perhaps the issue is with the streaming?

on version 2.6 release notes it says:
Note: Streaming enhancements are not yet integrated into the high-level APIs or existing FL algorithm controllers/executors. However, users can build custom controllers or executors to leverage this feature. Full support will be included in a future release.

and in the llm_hf docs it says:
Handling large model weights (~6 GB for Llama-3.2-1B model with float32 precision for communication), which is beyond protobuf's 2 GB hard limit. It is supported by NVFlare infrastructure via streaming, and does not need any code change.

I just want to make sure that indeed no code change is needed for streaming Llama-3.2

chesterxgchen Sep 12, 2025
Maintainer

https://github.com/NVIDIA/NVFlare/releases/tag/2.6.0 here are the release notes which has detailed changes.

There are many streaming related feature. the basic streaming large model ( > 2G) was already released in previous release (2.4.0 ?). The 2.6.0 release notes mentioned is "Unlimited streaming" feature. As 2.4.0 streaming is limited by memory capacity. "Unlimited size streaming" is streamed from file, so it not bound to the memory. these are two different topics

chesterxgchen Sep 12, 2025
Maintainer

Due to company holiday, no team member is at work these two days, I will ask team to take a look next week

ZiyueXu77 · 2025-09-15T17:13:20Z

ZiyueXu77
Sep 15, 2025
Maintainer

I tested and can confirm the issue, @YuanTingHsieh I think this is related to the quantization issue we are investigating regarding external process launcher. I can consistently observe the following before system erroring out:

2025-09-11 17:41:21,176 - conn_manager - INFO - Connection [CN00011 127.0.0.1:5203 <= 127.0.0.1:35486] is created: PID: 797
2025-09-11 17:42:21,259 - conn_manager - INFO - Connection [CN00011 Not Connected] is closed PID: 797

I replaced the external with in process and the problem is gone on my side. @oded-byte could you also try the below client config to see if it works?

{
    "format_version": 2,
    "executors": [
        {
            "tasks": [
                "train"
            ],
            "executor": {
                "path": "nvflare.app_opt.pt.in_process_client_api_executor.PTInProcessClientAPIExecutor",
                "args": {
                    "task_script_path": "src/hf_sft_peft_fl.py",
                    "task_script_args": "--model_name_or_path meta-llama/llama-3.2-1b --data_path_train /home/duality/dataset/site-1/training.jsonl --data_path_valid /home/duality/dataset/site-1/validation.jsonl --output_path sft --train_mode SFT --message_mode tensor --num_rounds 3"
                }
            }
        }
    ],
    "components": [],
    "task_data_filters": [],
    "task_result_filters": [],
    "submit_task_result_timeout": 300
}

0 replies

oded-byte · 2025-09-16T05:41:17Z

oded-byte
Sep 16, 2025
Author

Hi,
it seems the problem is gone when using a stronger GPU, probably due to memory. Does that makes sense?
I switched from V100 to A100 40GB and the problem was resolved.

1 reply

ZiyueXu77 Sep 16, 2025
Maintainer

Thanks for the update! Yes that makes sense, we were suspecting timeouts / resources, will continue to dig out the root cause and update with all.

ZiyueXu77 · 2025-09-18T16:16:32Z

ZiyueXu77
Sep 18, 2025
Maintainer

Hi @oded-byte, we found the root cause, it is these two timeouts:

peer_read_timeout (Line36) and heartbeat_timeout (Line40),
https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_opt/pt/client_api_launcher_executor.py#L36-L40

setting both to 300 and the issue's gone on my side, you can also test and adjust on your machine as the speed of each machine’s different, as you noticed, faster machine can work with defaults.

Thanks for noticing and raising this! We will update our APIs accordingly and figure out a good way to have these timeouts set properly.

0 replies

YuanTingHsieh · 2025-09-19T17:01:19Z

YuanTingHsieh
Sep 19, 2025
Maintainer

@oded-byte we increase the default timeout for main branch: #3671
You can do the same using your job configuration.

something like:

"executors": [
        {
            "tasks": [
                "train"
            ],
            "executor": {
                "path": "nvflare.app_opt.pt.client_api_launcher_executor.PTClientAPILauncherExecutor",
                "args": {
                    "pipe_id": "pipe",
                    "launcher_id": "launcher",
                    "server_expected_format": "pytorch",
                    "last_result_transfer_timeout": 3600,
                    "peer_read_timeout": 3600,
                    "heartbeat_timeout": 3600
                }
            }
        }
    ],

0 replies

[Q&A] I was wondering if someone tested the llm_hf example in production mode #3685

Uh oh!

oded-byte Sep 11, 2025

Python version (python3 -V)

NVFlare version (python3 -m pip list | grep "nvflare")

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, git branch)

Operating system

Have you successfully run any of the following examples?

Please describe your question

Replies: 5 comments · 4 replies

Uh oh!

chesterxgchen Sep 12, 2025 Maintainer

Uh oh!

oded-byte Sep 12, 2025 Author

Uh oh!

chesterxgchen Sep 12, 2025 Maintainer

Uh oh!

chesterxgchen Sep 12, 2025 Maintainer

Uh oh!

Uh oh!

ZiyueXu77 Sep 15, 2025 Maintainer

Uh oh!

oded-byte Sep 16, 2025 Author

Uh oh!

ZiyueXu77 Sep 16, 2025 Maintainer

Uh oh!

ZiyueXu77 Sep 18, 2025 Maintainer

Uh oh!

Uh oh!

YuanTingHsieh Sep 19, 2025 Maintainer

oded-byte
Sep 11, 2025

Python version (`python3 -V`)

NVFlare version (`python3 -m pip list | grep "nvflare"`)

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, `git branch`)

Replies: 5 comments 4 replies

chesterxgchen
Sep 12, 2025
Maintainer

oded-byte Sep 12, 2025
Author

chesterxgchen Sep 12, 2025
Maintainer

chesterxgchen Sep 12, 2025
Maintainer

ZiyueXu77
Sep 15, 2025
Maintainer

oded-byte
Sep 16, 2025
Author

ZiyueXu77 Sep 16, 2025
Maintainer

ZiyueXu77
Sep 18, 2025
Maintainer

YuanTingHsieh
Sep 19, 2025
Maintainer