Skip to content

Commit c2c3548

Browse files
[2.4] Add xgboost metrics tracking cb (#2381)
1 parent 4946ac6 commit c2c3548

File tree

27 files changed

+250
-114
lines changed

27 files changed

+250
-114
lines changed

examples/advanced/random_forest/README.md

Lines changed: 5 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -97,21 +97,10 @@ By default, CPU based training is used.
9797
If the CUDA is installed on the site, tree construction and prediction can be
9898
accelerated using GPUs.
9999

100-
GPUs are enabled by using :code:`gpu_hist` as :code:`tree_method` parameter.
101-
For example,
102-
::
103-
"xgboost_params": {
104-
"max_depth": 8,
105-
"eta": 0.1,
106-
"objective": "binary:logistic",
107-
"eval_metric": "auc",
108-
"tree_method": "gpu_hist",
109-
"gpu_id": 0,
110-
"nthread": 16
111-
}
112-
113-
For GPU based training, edit `job_config_gen.sh` to change `TREE_METHOD="hist"` to `TREE_METHOD="gpu_hist"`.
114-
Then run the `job_config_gen.sh` again to generates new job configs for GPU-based training.
100+
In order to enable GPU accelerated training, first ensure that your machine has CUDA installed and has at least one GPU.
101+
In `config_fed_client.json` set `"use_gpus": true` and `"tree_method": "hist"`.
102+
Then, in `FedXGBTreeExecutor` we use the `device` parameter to map each rank to a GPU device ordinal.
103+
If using multiple GPUs, we can map each rank to a different GPU device, however you can also map each rank to the same GPU device if using a single GPU.
115104

116105
## Run experiments
117106
After you run the two scripts `data_split_gen.sh` and `jobs_gen.sh`, the experiments can be run with the NVFlare simulator.
@@ -162,4 +151,4 @@ AUC over first 1000000 instances is: 0.7828698775310959
162151
AUC over first 1000000 instances is: 0.779952094937354
163152
20_clients_square_split_scaled_lr_split_0.02_subsample
164153
AUC over first 1000000 instances is: 0.7825360505137948
165-
```
154+
```

examples/advanced/random_forest/jobs_gen.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
#!/usr/bin/env bash
22

3-
# change to "gpu_hist" for gpu training
43
TREE_METHOD="hist"
54
DATA_SPLIT_ROOT="/tmp/nvflare/random_forest/HIGGS/data_splits"
65

examples/advanced/random_forest/utils/model_validation.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ def model_validation_args_parser():
4040
help="Total number of trees",
4141
)
4242
parser.add_argument(
43-
"--tree_method", type=str, default="hist", help="tree_method for xgboost - use hist or gpu_hist for best perf"
43+
"--tree_method", type=str, default="hist", help="tree_method for xgboost - use hist for best perf"
4444
)
4545
return parser
4646

examples/advanced/random_forest/utils/prepare_job_config.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ def job_config_args_parser():
4040
parser.add_argument("--lr_mode", type=str, default="uniform", help="Whether to use uniform or scaled shrinkage")
4141
parser.add_argument("--nthread", type=int, default=16, help="nthread for xgboost")
4242
parser.add_argument(
43-
"--tree_method", type=str, default="hist", help="tree_method for xgboost - use hist or gpu_hist for best perf"
43+
"--tree_method", type=str, default="hist", help="tree_method for xgboost - use hist for best perf"
4444
)
4545
return parser
4646

examples/advanced/vertical_xgboost/README.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
This example shows how to use vertical federated learning with [NVIDIA FLARE](https://nvflare.readthedocs.io/en/main/index.html) on tabular data.
33
Here we use the optimized gradient boosting library [XGBoost](https://github.com/dmlc/xgboost) and leverage its federated learning support.
44

5-
Before starting please make sure you set up a [virtual environment](../../../README.md#set-up-a-virtual-environment) and install the additional requirements:
5+
Before starting please make sure you set up a [virtual environment](../../README.md#set-up-a-virtual-environment) and install the additional requirements:
66
```
77
python3 -m pip install -r requirements.txt
88
```
@@ -30,7 +30,7 @@ Run the following command to prepare the data splits:
3030
### Private Set Intersection (PSI)
3131
Since not every site will have the same set of data samples (rows), we can use PSI to compare encrypted versions of the sites' datasets in order to jointly compute the intersection based on common IDs. In this example, the HIGGS dataset does not contain unique identifiers so we add a temporary `uid_{idx}` to each instance and give each site a portion of the HIGGS dataset that includes a common overlap. Afterwards the identifiers are dropped since they are only used for matching, and training is then done on the intersected data. To learn more about our PSI protocol implementation, see our [psi example](../psi/README.md).
3232

33-
> **_NOTE:_** The uid can be a composition of multiple variabes with a transformation, however in this example we use indices for simplicity. PSI can also be used for computing the intersection of overlapping features, but here we give each site unique features.
33+
> **_NOTE:_** The uid can be a composition of multiple variables with a transformation, however in this example we use indices for simplicity. PSI can also be used for computing the intersection of overlapping features, but here we give each site unique features.
3434
3535
Create the psi job using the predefined psi_csv template:
3636
```
@@ -58,7 +58,9 @@ Lastly, we must subclass `XGBDataLoader` and implement the `load_data()` method.
5858
By default, CPU based training is used.
5959

6060
In order to enable GPU accelerated training, first ensure that your machine has CUDA installed and has at least one GPU.
61-
In `config_fed_client.json` set `"use_gpus": true` and `"tree_method": "hist"` in `xgb_params`. Then, in `FedXGBHistogramExecutor` we use the `device` parameter to map each rank to a GPU device ordinal in `xgb_params`. If using multiple GPUs, we can map each rank to a different GPU device, however you can also map each rank to the same GPU device if using a single GPU.
61+
In `config_fed_client.json` set `"use_gpus": true` and `"tree_method": "hist"` in `xgb_params`.
62+
Then, in `FedXGBHistogramExecutor` we use the `device` parameter to map each rank to a GPU device ordinal in `xgb_params`.
63+
If using multiple GPUs, we can map each rank to a different GPU device, however you can also map each rank to the same GPU device if using a single GPU.
6264

6365
We can create a GPU enabled job using the job CLI:
6466
```
@@ -87,10 +89,11 @@ The model will be saved to `test.model.json`.
8789
## Results
8890
Model accuracy can be visualized in tensorboard:
8991
```
90-
tensorboard --logdir /tmp/nvflare/vertical_xgb
92+
tensorboard --logdir /tmp/nvflare/vertical_xgb/simulate_job/tb_events
9193
```
9294

93-
An example training (pink) and validation (orange) AUC graph from running vertical XGBoost on HIGGS.
94-
Used an intersection of 50000 samples across 5 clients each with different features, and ran for ~50 rounds due to early stopping.
95+
An example training (pink) and validation (orange) AUC graph from running vertical XGBoost on HIGGS:
96+
(Used an intersection of 50000 samples across 5 clients each with different features,
97+
and ran for ~50 rounds due to early stopping.)
9598

9699
![Vertical XGBoost graph](./figs/vertical_xgboost_graph.png)

examples/advanced/xgboost/README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,9 @@ By default, CPU based training is used.
139139
If the CUDA is installed on the site, tree construction and prediction can be
140140
accelerated using GPUs.
141141

142-
To enable GPU accelerated training, in `config_fed_client.json` set `"use_gpus": true` and `"tree_method": "hist"`. Then, in `FedXGBHistogramExecutor` we use the `device` parameter to map each rank to a GPU device ordinal in `xgb_params`. For a single GPU, assuming it has enough memory, we can map each rank to the same device with `params["device"] = f"cuda:0"`.
142+
To enable GPU accelerated training, in `config_fed_client.json` set `"use_gpus": true` and `"tree_method": "hist"`.
143+
Then, in `FedXGBHistogramExecutor` we use the `device` parameter to map each rank to a GPU device ordinal in `xgb_params`.
144+
For a single GPU, assuming it has enough memory, we can map each rank to the same device with `params["device"] = f"cuda:0"`.
143145

144146
### Multi GPU support
145147

examples/advanced/xgboost/histogram-based/README.md

Lines changed: 27 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,15 +11,21 @@ Switch to this directory and install additional requirements (suggest to do this
1111
python3 -m pip install -r requirements.txt
1212
```
1313

14+
### Run centralized experiments
15+
```
16+
bash run_experiment_centralized.sh
17+
```
18+
1419
### Run federated experiments with simulator locally
1520
Next, we will use the NVFlare simulator to run FL training automatically.
1621
```
17-
bash run_experiment_simulator.sh
22+
nvflare simulator jobs/higgs_2_histogram_v2_uniform_split_uniform_lr \
23+
-w /tmp/nvflare/xgboost_v2_workspace -n 2 -t 2
1824
```
1925

20-
### Run centralized experiments
26+
Model accuracy can be visualized in tensorboard:
2127
```
22-
bash run_experiment_centralized.sh
28+
tensorboard --logdir /tmp/nvflare/xgboost_v2_workspace/simulate_job/tb_events
2329
```
2430

2531
### Run federated experiments in real world
@@ -51,4 +57,21 @@ The custom executor can inherit the base class `FedXGBHistogramExecutor` and
5157
overwrite the `xgb_train()` method.
5258

5359
To use other dataset, can inherit the base class `XGBDataLoader` and
54-
implement that `load_data()` method.
60+
implement the `load_data()` method.
61+
62+
## Loose integration
63+
64+
We can use the NVFlare controller/executor just to launch the external xgboost
65+
federated server and client.
66+
67+
### Run federated experiments with simulator locally
68+
Next, we will use the NVFlare simulator to run FL training automatically.
69+
```
70+
nvflare simulator jobs/higgs_2_histogram_uniform_split_uniform_lr \
71+
-w /tmp/nvflare/xgboost_workspace -n 2 -t 2
72+
```
73+
74+
Model accuracy can be visualized in tensorboard:
75+
```
76+
tensorboard --logdir /tmp/nvflare/xgboost_workspace/simulate_job/tb_events
77+
```

examples/advanced/xgboost/histogram-based/jobs/base/app/config/config_fed_client.json

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
"data_loader_id": "dataloader",
1414
"num_rounds": "{num_rounds}",
1515
"early_stopping_rounds": 2,
16+
"metrics_writer_id": "metrics_writer",
1617
"xgb_params": {
1718
"max_depth": 8,
1819
"eta": 0.1,
@@ -34,6 +35,16 @@
3435
"args": {
3536
"data_split_filename": "data_split.json"
3637
}
38+
},
39+
{
40+
"id": "metrics_writer",
41+
"path": "nvflare.app_opt.tracking.tb.tb_writer.TBWriter",
42+
"args": {"event_type": "analytix_log_stats"}
43+
},
44+
{
45+
"id": "event_to_fed",
46+
"name": "ConvertToFedEvent",
47+
"args": {"events_to_convert": ["analytix_log_stats"], "fed_event_prefix": "fed."}
3748
}
3849
]
3950
}

examples/advanced/xgboost/histogram-based/jobs/base/app/config/config_fed_server.json

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,15 @@
22
"format_version": 2,
33
"task_data_filters": [],
44
"task_result_filters": [],
5-
"components": [],
5+
"components": [
6+
{
7+
"id": "tb_receiver",
8+
"path": "nvflare.app_opt.tracking.tb.tb_receiver.TBAnalyticsReceiver",
9+
"args": {
10+
"tb_folder": "tb_events"
11+
}
12+
}
13+
],
614
"workflows": [
715
{
816
"id": "xgb_controller",

examples/advanced/xgboost/histogram-based/jobs/base_v2/app/config/config_fed_client.json

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
"path": "nvflare.app_opt.xgboost.histogram_based_v2.executor.FedXGBHistogramExecutor",
1212
"args": {
1313
"data_loader_id": "dataloader",
14+
"metrics_writer_id": "metrics_writer",
1415
"early_stopping_rounds": 2,
1516
"xgb_params": {
1617
"max_depth": 8,
@@ -33,6 +34,16 @@
3334
"args": {
3435
"data_split_filename": "data_split.json"
3536
}
37+
},
38+
{
39+
"id": "metrics_writer",
40+
"path": "nvflare.app_opt.tracking.tb.tb_writer.TBWriter",
41+
"args": {"event_type": "analytix_log_stats"}
42+
},
43+
{
44+
"id": "event_to_fed",
45+
"name": "ConvertToFedEvent",
46+
"args": {"events_to_convert": ["analytix_log_stats"], "fed_event_prefix": "fed."}
3647
}
3748
]
3849
}

0 commit comments

Comments
 (0)