Skip to content

Commit 40cff16

Browse files
[2.4] Add reliable xgboost documentation (#2471)
* Add reliable xgboost documentation * Add missing doc --------- Co-authored-by: Chester Chen <[email protected]>
1 parent f016cb4 commit 40cff16

File tree

10 files changed

+208
-20
lines changed

10 files changed

+208
-20
lines changed

docs/resources/fed_xgb_detail.png

91.6 KB
Loading

docs/resources/loose_xgb.png

123 KB
Loading

docs/resources/tight_xgb.png

98.3 KB
Loading

docs/user_guide.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,4 @@ please refer to the :ref:`programming_guide`.
2323
user_guide/helm_chart
2424
user_guide/confidential_computing
2525
user_guide/hierarchy_unification_bridge
26+
user_guide/federated_xgboost
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
##############################
2+
Federated XGBoost with NVFlare
3+
##############################
4+
5+
XGBoost (https://github.com/dmlc/xgboost) is an open-source project that
6+
implements machine learning algorithms under the Gradient Boosting framework.
7+
It is an optimized distributed gradient boosting library designed to be highly
8+
efficient, flexible and portable.
9+
This implementation uses MPI (message passing interface) for client
10+
communication and synchronization.
11+
12+
MPI requires the underlying communication network to be perfect - a single
13+
message drop causes the training to fail.
14+
15+
This is usually achieved via a highly reliable special-purpose network like NCCL.
16+
17+
The open-source XGBoost supports federated paradigm, where clients are in different
18+
locations and communicate with each other with gRPC over internet connections.
19+
20+
We introduce federated XGBoost with NVFlare for a more reliable federated setup.
21+
22+
.. toctree::
23+
:maxdepth: 1
24+
25+
federated_xgboost/implementation
26+
federated_xgboost/timeout
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
#################################
2+
Reliable Federated XGBoost Design
3+
#################################
4+
5+
6+
*************************
7+
Flare as XGBoost Launcher
8+
*************************
9+
10+
NVFLARE serves as a launchpad to start the XGBoost system.
11+
Once started, the XGBoost system runs independently of FLARE,
12+
as illustrated in the following figure.
13+
14+
.. figure:: ../../resources/loose_xgb.png
15+
:height: 500px
16+
17+
There are a few potential problems with this approach:
18+
19+
- As we know, MPI requires a perfect communication network,
20+
whereas the simple gRPC over the internet could be unstable.
21+
22+
- For each job, the XGBoost Server must open a port for clients to connect to.
23+
This adds burden to request IT for the additional port in the real-world situation.
24+
Even if a fixed port is allowed to open, and we reuse that port,
25+
multiple XGBoost jobs can not be run at the same time,
26+
since each XGBoost job requires a different port number.
27+
28+
29+
*****************************
30+
Flare as XGBoost Communicator
31+
*****************************
32+
33+
FLARE provides a highly flexible, scalable and reliable communication mechanism.
34+
We enhance the reliability of federated XGBoost by using FLARE as the communicator of XGBoost,
35+
as shown here:
36+
37+
.. figure:: ../../resources/tight_xgb.png
38+
:height: 500px
39+
40+
Detailed Design
41+
===============
42+
43+
The open-source Federated XGBoost (c++) uses gRPC as the communication protocol.
44+
To use FLARE as the communicator, we simply route XGBoost's gRPC messages through FLARE.
45+
To do so, we change the server endpoint of each XGBoost client to a local gRPC server
46+
(LGS) within the FLARE client.
47+
48+
.. figure:: ../../resources/fed_xgb_detail.png
49+
:height: 500px
50+
51+
As shown in this diagram, there is a local GRPC server (LGS) for each site
52+
that serves as the server endpoint for the XGBoost client on the site.
53+
Similarly, there is a local GRPC Client (LGC) on the FL Server that
54+
interacts with the XGBoost Server. The message path between the XGBoost Client and
55+
the XGBoost Server is as follows:
56+
57+
1. The XGBoost client generates a gRPC message and sends it to the LGS in FLARE Client
58+
2. FLARE Client forwards the message to the FLARE Server. This is a reliable FLARE message.
59+
3. FLARE Server uses the LGC to send the message to the XGBoost Server.
60+
4. XGBoost Server sends the response back to the LGC in FLARE Server.
61+
5. FLARE Server sends the response back to the FLARE Client.
62+
6. FLARE Client sends the response back to the XGBoost Client via the LGS.
63+
64+
Please note that the XGBoost Client (c++) component could be running as a separate process
65+
or within the same process of FLARE Client.
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
############################################
2+
Reliable Federated XGBoost Timeout Mechanism
3+
############################################
4+
5+
NVFlare introduces a tightly-coupled integration between XGBoost and NVFlare.
6+
NVFlare implements the ReliableMessage mechanism to make XGBoost’s server/client
7+
interactions more robust over unstable internet connections.
8+
9+
Unstable internet connection is the situation where the connections between
10+
the communication endpoints have random disconnects/reconnects and unstable speed.
11+
It is not meant to be an extended internet outage.
12+
13+
ReliableMessage does not mean guaranteed delivery.
14+
It only means that it will try its best to deliver the message to the peer.
15+
If one attempt fails, it will keep trying until either the message is
16+
successfully delivered or a specified "transaction timeout" is reached.
17+
18+
*****************
19+
Timeout Mechanism
20+
*****************
21+
22+
In runtime, the FLARE System is configured with a few important timeout parameters.
23+
24+
ReliableMessage Timeout
25+
=======================
26+
27+
There are two timeout values to control the behavior of ReliableMessage (RM).
28+
29+
Per-message Timeout
30+
-------------------
31+
32+
Essentially RM tries to resend the message until delivered successfully.
33+
Each resend of the message requires a timeout value.
34+
This value should be defined based on the message size, overall network speed,
35+
and the amount of time needed to process the message in a normal situation.
36+
For example, if an XGBoost message takes no more than 5 seconds to be
37+
sent, processed, and replied.
38+
The per-message timeout should be set to 5 seconds.
39+
40+
.. note::
41+
42+
Note that the initial XGBoost message might take more than 100 seconds
43+
depends on the dataset size.
44+
45+
Transaction Timeout
46+
-------------------
47+
48+
This value defines how long you want RM to keep retrying until done, in case
49+
of unstable connection.
50+
This value should be defined based on the overall stability of the connection,
51+
nature of the connection, and how quickly the connection is restored.
52+
For occasional connection glitches, this value shouldn't have to be too big
53+
(e.g. 20 seconds).
54+
However if the outage is long (say 60 seconds or longer), then this value
55+
should be big enough.
56+
57+
.. note::
58+
59+
Note that even if you think the connection is restored (e.g. replugged
60+
the internet cable or reactivated WIFI), the underlying connection
61+
layer may take much longer to actually restore connections (e.g. up to
62+
a few minutes)!
63+
64+
.. note::
65+
66+
Note: if the transaction timeout is <= per-message timeout, then the
67+
message will be sent through simple messaging - no retry will be done
68+
in case of failure.
69+
70+
XGBoost Client Operation Timeout
71+
================================
72+
73+
To prevent a XGBoost client from running forever, the XGBoost/FLARE
74+
integration lets you define a parameter (max_client_op_interval) on the
75+
server side to control the max amount of time permitted for a client to be
76+
silent (i.e. no messages sent to the server).
77+
The default value of this parameter is 900 seconds, meaning that if no XGB
78+
message is received from the client for over 900 seconds, then that client
79+
is considered dead, and the whole job is aborted.
80+
81+
***************************
82+
Configure Timeouts Properly
83+
***************************
84+
85+
These timeout values are related. For example, if the transaction timeout
86+
is greater than the server timeout, then it won't be that effective since
87+
the server will treat the client to be dead once the server timeout is reached
88+
anyway. Similarly, it does not make sense to have transaction timeout > XGBoost
89+
client op timeout.
90+
91+
In general, follow this rule:
92+
93+
Per-message Timeout < Transaction Timeout < XGBoost Client Operation Timeout

examples/advanced/xgboost/histogram-based/README.md

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -28,27 +28,15 @@ Model accuracy can be visualized in tensorboard:
2828
tensorboard --logdir /tmp/nvflare/xgboost_v2_workspace/simulate_job/tb_events
2929
```
3030

31-
### Run federated experiments in real world
31+
## Timeout configuration
3232

33-
To run in a federated setting, follow [Real-World FL](https://nvflare.readthedocs.io/en/main/real_world_fl.html) to
34-
start the overseer, FL servers and FL clients.
35-
36-
You need to download the HIGGS data on each client site.
37-
You will also need to install the xgboost on each client site and server site.
38-
39-
You can still generate the data splits and job configs using the scripts provided.
40-
41-
You will need to copy the generated data split file into each client site.
42-
You might also need to modify the `data_path` in the `data_site-XXX.json`
43-
inside the `/tmp/nvflare/xgboost_higgs_dataset` folder,
44-
since each site might save the HIGGS dataset in different places.
45-
46-
Then you can use admin client to submit the job via `submit_job` command.
33+
Please refer to [Reliable Federated XGBoost Timeout Mechanism](https://nvflare.readthedocs.io/en/2.4/user_guide/reliable_xgboost.html)
4734

4835
## Customization
4936

50-
The provided XGBoost executor can be customized using Boost parameters
51-
provided in `xgb_params` argument.
37+
The provided FedXGBHistogramExecutor can be customized by passing
38+
[xgboost parameters](https://xgboost.readthedocs.io/en/stable/parameter.html)
39+
in the `xgb_params` argument.
5240

5341
If the parameter change alone is not sufficient and code changes are required,
5442
a custom executor can be implemented to make calls to xgboost library directly.
@@ -59,13 +47,25 @@ overwrite the `xgb_train()` method.
5947
To use other dataset, can inherit the base class `XGBDataLoader` and
6048
implement the `load_data()` method.
6149

50+
## Run in real world
51+
52+
To run in a federated setting, follow [Real-World FL](https://nvflare.readthedocs.io/en/main/real_world_fl.html) to
53+
start the overseer, FL servers and FL clients.
54+
55+
1. Each participating site need to install xgboost and nvflare.
56+
2. Each participating site need to have their own data loader
57+
or use the same dataloader but with different location to load data
58+
(can refer to higgs_data_loader.py to write one for their own data)
59+
60+
Then you can use admin client to submit the job via `submit_job` command.
61+
6262
## GPU support
6363
By default, CPU based training is used.
6464

6565
If the CUDA is installed on the site, tree construction and prediction can be
6666
accelerated using GPUs.
6767

68-
To enable GPU accelerated training, in `config_fed_client` set the args of
68+
To enable GPU accelerated training, in `config_fed_client` set the args of
6969
`FedXGBHistogramExecutor` to `"use_gpus": true` and set `"tree_method": "hist"`
7070
in `xgb_params`.
7171

examples/advanced/xgboost/histogram-based/jobs/base_v2/app/config/config_fed_client.json

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
{
22
"format_version": 2,
3-
"num_rounds": 100,
43
"executors": [
54
{
65
"tasks": [
@@ -20,7 +19,9 @@
2019
"eval_metric": "auc",
2120
"tree_method": "hist",
2221
"nthread": 16
23-
}
22+
},
23+
"per_msg_timeout": 100,
24+
"tx_timeout": 500
2425
}
2526
}
2627
}

job_templates/vertical_xgb/config_fed_client.conf

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,8 @@ executors = [
2424
use_gpus = false
2525
metrics_writer_id = "metrics_writer"
2626
model_file_name = "test.model.json"
27+
per_msg_timeout = 100
28+
tx_timeout = 500
2729
}
2830
}
2931
}

0 commit comments

Comments
 (0)