[2.4] Add xgboost metrics tracking cb (#2381)

YuanTingHsieh · web-flow · commit c2c3548423dc · 2024-03-11T09:50:06.000-07:00
diff --git a/examples/advanced/random_forest/README.md b/examples/advanced/random_forest/README.md
@@ -97,21 +97,10 @@ By default, CPU based training is used.
 If the CUDA is installed on the site, tree construction and prediction can be
 accelerated using GPUs.
 
-GPUs are enabled by using :code:`gpu_hist` as :code:`tree_method` parameter.
-For example,
-::
-              "xgboost_params": {
-                "max_depth": 8,
-                "eta": 0.1,
-                "objective": "binary:logistic",
-                "eval_metric": "auc",
-                "tree_method": "gpu_hist",
-                "gpu_id": 0,
-                "nthread": 16
-              }
-
-For GPU based training, edit `job_config_gen.sh` to change `TREE_METHOD="hist"` to `TREE_METHOD="gpu_hist"`.
-Then run the `job_config_gen.sh` again to generates new job configs for GPU-based training.
+In order to enable GPU accelerated training, first ensure that your machine has CUDA installed and has at least one GPU.
+In `config_fed_client.json` set `"use_gpus": true` and  `"tree_method": "hist"`.
+Then, in `FedXGBTreeExecutor` we use the `device` parameter to map each rank to a GPU device ordinal.
+If using multiple GPUs, we can map each rank to a different GPU device, however you can also map each rank to the same GPU device if using a single GPU.
 
 ## Run experiments 
 After you run the two scripts `data_split_gen.sh` and `jobs_gen.sh`, the experiments can be run with the NVFlare simulator.
@@ -162,4 +151,4 @@ AUC over first 1000000 instances is: 0.7828698775310959
 AUC over first 1000000 instances is: 0.779952094937354
 20_clients_square_split_scaled_lr_split_0.02_subsample
 AUC over first 1000000 instances is: 0.7825360505137948
-```
+```
diff --git a/examples/advanced/random_forest/jobs_gen.sh b/examples/advanced/random_forest/jobs_gen.sh
@@ -1,6 +1,5 @@
 #!/usr/bin/env bash
 
-# change to "gpu_hist" for gpu training
 TREE_METHOD="hist"
 DATA_SPLIT_ROOT="/tmp/nvflare/random_forest/HIGGS/data_splits"
 
diff --git a/examples/advanced/random_forest/utils/model_validation.py b/examples/advanced/random_forest/utils/model_validation.py
@@ -40,7 +40,7 @@ def model_validation_args_parser():
         help="Total number of trees",
     )
     parser.add_argument(
-        "--tree_method", type=str, default="hist", help="tree_method for xgboost - use hist or gpu_hist for best perf"
+        "--tree_method", type=str, default="hist", help="tree_method for xgboost - use hist for best perf"
     )
     return parser
 
diff --git a/examples/advanced/random_forest/utils/prepare_job_config.py b/examples/advanced/random_forest/utils/prepare_job_config.py
@@ -40,7 +40,7 @@ def job_config_args_parser():
     parser.add_argument("--lr_mode", type=str, default="uniform", help="Whether to use uniform or scaled shrinkage")
     parser.add_argument("--nthread", type=int, default=16, help="nthread for xgboost")
     parser.add_argument(
-        "--tree_method", type=str, default="hist", help="tree_method for xgboost - use hist or gpu_hist for best perf"
+        "--tree_method", type=str, default="hist", help="tree_method for xgboost - use hist for best perf"
     )
     return parser
 
diff --git a/examples/advanced/vertical_xgboost/README.md b/examples/advanced/vertical_xgboost/README.md
@@ -2,7 +2,7 @@
 This example shows how to use vertical federated learning with [NVIDIA FLARE](https://nvflare.readthedocs.io/en/main/index.html) on tabular data.
 Here we use the optimized gradient boosting library [XGBoost](https://github.com/dmlc/xgboost) and leverage its federated learning support.
 
-Before starting please make sure you set up a [virtual environment](../../../README.md#set-up-a-virtual-environment) and install the additional requirements:
+Before starting please make sure you set up a [virtual environment](../../README.md#set-up-a-virtual-environment) and install the additional requirements:
 ```
 python3 -m pip install -r requirements.txt
 ```
@@ -30,7 +30,7 @@ Run the following command to prepare the data splits:
 ### Private Set Intersection (PSI)
 Since not every site will have the same set of data samples (rows), we can use PSI to compare encrypted versions of the sites' datasets in order to jointly compute the intersection based on common IDs. In this example, the HIGGS dataset does not contain unique identifiers so we add a temporary `uid_{idx}` to each instance and give each site a portion of the HIGGS dataset that includes a common overlap. Afterwards the identifiers are dropped since they are only used for matching, and training is then done on the intersected data. To learn more about our PSI protocol implementation, see our [psi example](../psi/README.md).
 
-> **_NOTE:_** The uid can be a composition of multiple variabes with a transformation, however in this example we use indices for simplicity. PSI can also be used for computing the intersection of overlapping features, but here we give each site unique features.
+> **_NOTE:_** The uid can be a composition of multiple variables with a transformation, however in this example we use indices for simplicity. PSI can also be used for computing the intersection of overlapping features, but here we give each site unique features.
 
 Create the psi job using the predefined psi_csv template:
 ```
@@ -58,7 +58,9 @@ Lastly, we must subclass `XGBDataLoader` and implement the `load_data()` method.
 By default, CPU based training is used.
 
 In order to enable GPU accelerated training, first ensure that your machine has CUDA installed and has at least one GPU.
-In `config_fed_client.json` set `"use_gpus": true` and  `"tree_method": "hist"` in `xgb_params`. Then, in `FedXGBHistogramExecutor` we use the `device` parameter to map each rank to a GPU device ordinal in `xgb_params`. If using multiple GPUs, we can map each rank to a different GPU device, however you can also map each rank to the same GPU device if using a single GPU.
+In `config_fed_client.json` set `"use_gpus": true` and  `"tree_method": "hist"` in `xgb_params`.
+Then, in `FedXGBHistogramExecutor` we use the `device` parameter to map each rank to a GPU device ordinal in `xgb_params`.
+If using multiple GPUs, we can map each rank to a different GPU device, however you can also map each rank to the same GPU device if using a single GPU.
 
 We can create a GPU enabled job using the job CLI:
 ```
@@ -87,10 +89,11 @@ The model will be saved to `test.model.json`.
 ## Results
 Model accuracy can be visualized in tensorboard:
 ```
-tensorboard --logdir /tmp/nvflare/vertical_xgb
+tensorboard --logdir /tmp/nvflare/vertical_xgb/simulate_job/tb_events
 ```
 
-An example training (pink) and validation (orange) AUC graph from running vertical XGBoost on HIGGS.
-Used an intersection of 50000 samples across 5 clients each with different features, and ran for ~50 rounds due to early stopping.
+An example training (pink) and validation (orange) AUC graph from running vertical XGBoost on HIGGS:
+(Used an intersection of 50000 samples across 5 clients each with different features,
+and ran for ~50 rounds due to early stopping.)
 
 ![Vertical XGBoost graph](./figs/vertical_xgboost_graph.png)
diff --git a/examples/advanced/xgboost/README.md b/examples/advanced/xgboost/README.md
@@ -139,7 +139,9 @@ By default, CPU based training is used.
 If the CUDA is installed on the site, tree construction and prediction can be
 accelerated using GPUs.
 
-To enable GPU accelerated training, in `config_fed_client.json` set `"use_gpus": true` and  `"tree_method": "hist"`. Then, in `FedXGBHistogramExecutor` we use the `device` parameter to map each rank to a GPU device ordinal in `xgb_params`. For a single GPU, assuming it has enough memory, we can map each rank to the same device with `params["device"] = f"cuda:0"`.
+To enable GPU accelerated training, in `config_fed_client.json` set `"use_gpus": true` and  `"tree_method": "hist"`.
+Then, in `FedXGBHistogramExecutor` we use the `device` parameter to map each rank to a GPU device ordinal in `xgb_params`.
+For a single GPU, assuming it has enough memory, we can map each rank to the same device with `params["device"] = f"cuda:0"`.
 
 ### Multi GPU support
 
diff --git a/examples/advanced/xgboost/histogram-based/README.md b/examples/advanced/xgboost/histogram-based/README.md
@@ -11,15 +11,21 @@ Switch to this directory and install additional requirements (suggest to do this
 python3 -m pip install -r requirements.txt
 ```
 
+### Run centralized experiments
+```
+bash run_experiment_centralized.sh
+```
+
 ### Run federated experiments with simulator locally
 Next, we will use the NVFlare simulator to run FL training automatically.
 ```
-bash run_experiment_simulator.sh
+nvflare simulator jobs/higgs_2_histogram_v2_uniform_split_uniform_lr \
+   -w /tmp/nvflare/xgboost_v2_workspace -n 2 -t 2
 ```
 
-### Run centralized experiments
+Model accuracy can be visualized in tensorboard:
 ```
-bash run_experiment_centralized.sh
+tensorboard --logdir /tmp/nvflare/xgboost_v2_workspace/simulate_job/tb_events
 ```
 
 ### Run federated experiments in real world
@@ -51,4 +57,21 @@ The custom executor can inherit the base class `FedXGBHistogramExecutor` and
 overwrite the `xgb_train()` method.
 
 To use other dataset, can inherit the base class `XGBDataLoader` and
-implement that `load_data()` method.
+implement the `load_data()` method.
+
+## Loose integration
+
+We can use the NVFlare controller/executor just to launch the external xgboost
+federated server and client.
+
+### Run federated experiments with simulator locally
+Next, we will use the NVFlare simulator to run FL training automatically.
+```
+nvflare simulator jobs/higgs_2_histogram_uniform_split_uniform_lr \
+   -w /tmp/nvflare/xgboost_workspace -n 2 -t 2
+```
+
+Model accuracy can be visualized in tensorboard:
+```
+tensorboard --logdir /tmp/nvflare/xgboost_workspace/simulate_job/tb_events
+```
diff --git a/examples/advanced/xgboost/histogram-based/jobs/base/app/config/config_fed_client.json b/examples/advanced/xgboost/histogram-based/jobs/base/app/config/config_fed_client.json
@@ -13,6 +13,7 @@
           "data_loader_id": "dataloader",
           "num_rounds": "{num_rounds}",
           "early_stopping_rounds": 2,
+          "metrics_writer_id": "metrics_writer",
           "xgb_params": {
             "max_depth": 8,
             "eta": 0.1,
@@ -34,6 +35,16 @@
       "args": {
         "data_split_filename": "data_split.json"
       }
+    },
+    {
+      "id": "metrics_writer",
+      "path": "nvflare.app_opt.tracking.tb.tb_writer.TBWriter",
+      "args": {"event_type": "analytix_log_stats"}
+    },
+    {
+      "id": "event_to_fed",
+      "name": "ConvertToFedEvent",
+      "args": {"events_to_convert": ["analytix_log_stats"], "fed_event_prefix": "fed."}
     }
   ]
 }
diff --git a/examples/advanced/xgboost/histogram-based/jobs/base/app/config/config_fed_server.json b/examples/advanced/xgboost/histogram-based/jobs/base/app/config/config_fed_server.json
@@ -2,7 +2,15 @@
   "format_version": 2,
   "task_data_filters": [],
   "task_result_filters": [],
-  "components": [],
+  "components": [
+    {
+      "id": "tb_receiver",
+      "path": "nvflare.app_opt.tracking.tb.tb_receiver.TBAnalyticsReceiver",
+      "args": {
+        "tb_folder": "tb_events"
+      }
+    }
+  ],
   "workflows": [
     {
       "id": "xgb_controller",
diff --git a/examples/advanced/xgboost/histogram-based/jobs/base_v2/app/config/config_fed_client.json b/examples/advanced/xgboost/histogram-based/jobs/base_v2/app/config/config_fed_client.json
@@ -11,6 +11,7 @@
         "path": "nvflare.app_opt.xgboost.histogram_based_v2.executor.FedXGBHistogramExecutor",
         "args": {
           "data_loader_id": "dataloader",
+          "metrics_writer_id": "metrics_writer",
           "early_stopping_rounds": 2,
           "xgb_params": {
             "max_depth": 8,
@@ -33,6 +34,16 @@
       "args": {
         "data_split_filename": "data_split.json"
       }
+    },
+    {
+      "id": "metrics_writer",
+      "path": "nvflare.app_opt.tracking.tb.tb_writer.TBWriter",
+      "args": {"event_type": "analytix_log_stats"}
+    },
+    {
+      "id": "event_to_fed",
+      "name": "ConvertToFedEvent",
+      "args": {"events_to_convert": ["analytix_log_stats"], "fed_event_prefix": "fed."}
     }
   ]
 }
diff --git a/examples/advanced/xgboost/histogram-based/jobs/base_v2/app/config/config_fed_server.json b/examples/advanced/xgboost/histogram-based/jobs/base_v2/app/config/config_fed_server.json
@@ -3,7 +3,15 @@
   "num_rounds": 100,
   "task_data_filters": [],
   "task_result_filters": [],
-  "components": [],
+  "components": [
+    {
+      "id": "tb_receiver",
+      "path": "nvflare.app_opt.tracking.tb.tb_receiver.TBAnalyticsReceiver",
+      "args": {
+        "tb_folder": "tb_events"
+      }
+    }
+  ],
   "workflows": [
     {
       "id": "xgb_controller",
diff --git a/examples/advanced/xgboost/histogram-based/xgboost_histogram_higgs.ipynb b/examples/advanced/xgboost/histogram-based/xgboost_histogram_higgs.ipynb
@@ -138,7 +138,7 @@
    "outputs": [],
    "source": [
     "%load_ext tensorboard\n",
-    "%tensorboard --logdir /tmp/nvflare/workspaces/xgboost_workspace_5_histogram_uniform_split_uniform_lr"
+    "%tensorboard --logdir /tmp/nvflare/workspaces/xgboost_workspace_5_histogram_uniform_split_uniform_lr/simulate_job/tb_events"
    ]
   }
  ],
diff --git a/examples/advanced/xgboost/prepare_job_config.sh b/examples/advanced/xgboost/prepare_job_config.sh
@@ -1,5 +1,4 @@
 #!/usr/bin/env bash
-# change to "gpu_hist" for gpu training
 TREE_METHOD="hist"
 
 prepare_job_config() {
diff --git a/examples/advanced/xgboost/tree-based/README.md b/examples/advanced/xgboost/tree-based/README.md
@@ -16,7 +16,7 @@ In addition to basic uniform shrinkage setting where all clients have the same l
 
 ## Run automated experiments
 Please make sure to finish the [preparation steps](../README.md) before running the following steps.
-To run all of the experiments in this example with NVFlare, follow the steps below. To try out a single experiment, follow this [notebook](./xgboost_tree_higgs.ipynb).
+To run all experiments in this example with NVFlare, follow the steps below. To try out a single experiment, follow this [notebook](./xgboost_tree_higgs.ipynb).
 
 ### Environment Preparation
 
diff --git a/examples/advanced/xgboost/utils/prepare_job_config.py b/examples/advanced/xgboost/utils/prepare_job_config.py
@@ -50,7 +50,7 @@ def job_config_args_parser():
     parser.add_argument("--lr_mode", type=str, default="uniform", help="Whether to use uniform or scaled shrinkage")
     parser.add_argument("--nthread", type=int, default=16, help="nthread for xgboost")
     parser.add_argument(
-        "--tree_method", type=str, default="hist", help="tree_method for xgboost - use hist or gpu_hist for best perf"
+        "--tree_method", type=str, default="hist", help="tree_method for xgboost - use hist for best perf"
     )
     return parser
 
diff --git a/job_templates/vertical_xgb/config_fed_client.conf b/job_templates/vertical_xgb/config_fed_client.conf
@@ -7,7 +7,7 @@ executors = [
     executor {
       # Federated XGBoost Executor for histogram-base collaboration
       id = "xgb_hist_executor"
-      name = "FedXGBHistogramExecutor"
+      path = "nvflare.app_opt.xgboost.histogram_based.executor.FedXGBHistogramExecutor"
       args {
         num_rounds = 100
         early_stopping_rounds = 2
@@ -23,6 +23,8 @@ executors = [
         data_loader_id = "dataloader"
         # whether to enable GPU training
         use_gpus = false
+        metrics_writer_id = "metrics_writer"
+        model_file_name = "test.model.json"
       }
     }
   }
@@ -47,4 +49,19 @@ components = [
       train_proportion = 0.8
     }
   }
+  {
+    id = "metrics_writer"
+    path = "nvflare.app_opt.tracking.tb.tb_writer.TBWriter"
+    args {
+      event_type = "analytix_log_stats"
+    }
+  }
+  {
+    id = "event_to_fed"
+    name = "ConvertToFedEvent"
+    args {
+      events_to_convert = ["analytix_log_stats"]
+      fed_event_prefix = "fed."
+    }
+  }
 ]
diff --git a/job_templates/vertical_xgb/config_fed_server.conf b/job_templates/vertical_xgb/config_fed_server.conf
@@ -1,7 +1,4 @@
 format_version = 2
-server {
-  heart_beat_timeout = 600
-}
 task_data_filters = []
 task_result_filters = []
 workflows = [
@@ -13,4 +10,12 @@ workflows = [
     }
   }
 ]
-components = []
+components = [
+  {
+    id = "tb_receiver"
+    path = "nvflare.app_opt.tracking.tb.tb_receiver.TBAnalyticsReceiver"
+    args {
+      tb_folder = tb_events
+    }
+  }
+]
diff --git a/nvflare/app_common/tracking/log_writer.py b/nvflare/app_common/tracking/log_writer.py
@@ -13,7 +13,9 @@
 # limitations under the License.
 
 from abc import ABC, abstractmethod
+from typing import Optional
 
+from nvflare.apis.analytix import AnalyticsDataType
 from nvflare.apis.event_type import EventType
 from nvflare.apis.fl_component import FLComponent
 from nvflare.apis.fl_context import FLContext
@@ -41,6 +43,23 @@ def handle_event(self, event_type: str, fl_ctx: FLContext):
                 self.sender = AnalyticsSender(self.event_type, self.get_writer_name())
                 self.sender.engine = engine
 
+    def write(self, tag: str, value, data_type: AnalyticsDataType, global_step: Optional[int] = None, **kwargs):
+        """Writes a record.
+
+        Args:
+            tag (str): Tag name
+            value: Value to send
+            data_type (AnalyticsDataType): Data type of the value being sent
+            global_step (optional, int): Global step value.
+
+        Raises:
+            TypeError: global_step must be an int
+        """
+        self.sender.add(tag=tag, value=value, data_type=data_type, global_step=global_step, **kwargs)
+
     @abstractmethod
     def get_writer_name(self) -> LogWriterName:
         pass
+
+    def get_default_metric_data_type(self) -> AnalyticsDataType:
+        return AnalyticsDataType.METRICS
diff --git a/nvflare/app_opt/tracking/mlflow/mlflow_writer.py b/nvflare/app_opt/tracking/mlflow/mlflow_writer.py
diff --git a/nvflare/app_opt/tracking/tb/tb_writer.py b/nvflare/app_opt/tracking/tb/tb_writer.py
diff --git a/nvflare/app_opt/tracking/wandb/wandb_writer.py b/nvflare/app_opt/tracking/wandb/wandb_writer.py
diff --git a/nvflare/app_opt/xgboost/histogram_based/executor.py b/nvflare/app_opt/xgboost/histogram_based/executor.py
diff --git a/nvflare/app_opt/xgboost/histogram_based_v2/adaptors/grpc_client_adaptor.py b/nvflare/app_opt/xgboost/histogram_based_v2/adaptors/grpc_client_adaptor.py
diff --git a/nvflare/app_opt/xgboost/histogram_based_v2/defs.py b/nvflare/app_opt/xgboost/histogram_based_v2/defs.py
diff --git a/nvflare/app_opt/xgboost/histogram_based_v2/executor.py b/nvflare/app_opt/xgboost/histogram_based_v2/executor.py
diff --git a/nvflare/app_opt/xgboost/histogram_based_v2/runners/client_runner.py b/nvflare/app_opt/xgboost/histogram_based_v2/runners/client_runner.py
diff --git a/nvflare/app_opt/xgboost/metrics_cb.py b/nvflare/app_opt/xgboost/metrics_cb.py

Original file line number	Diff line number	Diff line change
`@@ -40,7 +40,7 @@ def model_validation_args_parser():`
`40`	`40`	`help="Total number of trees",`
`41`	`41`	`)`
`42`	`42`	`parser.add_argument(`
`43`		`- "--tree_method", type=str, default="hist", help="tree_method for xgboost - use hist or gpu_hist for best perf"`
	`43`	`+ "--tree_method", type=str, default="hist", help="tree_method for xgboost - use hist for best perf"`
`44`	`44`	`)`
`45`	`45`	`return parser`
`46`	`46`
Original file line number	Diff line number	Diff line change
`@@ -40,7 +40,7 @@ def job_config_args_parser():`
`40`	`40`	`parser.add_argument("--lr_mode", type=str, default="uniform", help="Whether to use uniform or scaled shrinkage")`
`41`	`41`	`parser.add_argument("--nthread", type=int, default=16, help="nthread for xgboost")`
`42`	`42`	`parser.add_argument(`
`43`		`- "--tree_method", type=str, default="hist", help="tree_method for xgboost - use hist or gpu_hist for best perf"`
	`43`	`+ "--tree_method", type=str, default="hist", help="tree_method for xgboost - use hist for best perf"`
`44`	`44`	`)`
`45`	`45`	`return parser`
`46`	`46`
Original file line number	Diff line number	Diff line change
`@@ -13,6 +13,7 @@`
`13`	`13`	`"data_loader_id": "dataloader",`
`14`	`14`	`"num_rounds": "{num_rounds}",`
`15`	`15`	`"early_stopping_rounds": 2,`
	`16`	`+ "metrics_writer_id": "metrics_writer",`
`16`	`17`	`"xgb_params": {`
`17`	`18`	`"max_depth": 8,`
`18`	`19`	`"eta": 0.1,`
`@@ -34,6 +35,16 @@`
`34`	`35`	`"args": {`
`35`	`36`	`"data_split_filename": "data_split.json"`
`36`	`37`	`}`
	`38`	`+ },`
	`39`	`+ {`
	`40`	`+ "id": "metrics_writer",`
	`41`	`+ "path": "nvflare.app_opt.tracking.tb.tb_writer.TBWriter",`
	`42`	`+ "args": {"event_type": "analytix_log_stats"}`
	`43`	`+ },`
	`44`	`+ {`
	`45`	`+ "id": "event_to_fed",`
	`46`	`+ "name": "ConvertToFedEvent",`
	`47`	`+ "args": {"events_to_convert": ["analytix_log_stats"], "fed_event_prefix": "fed."}`
`37`	`48`	`}`
`38`	`49`	`]`
`39`	`50`	`}`
Original file line number	Diff line number	Diff line change
`@@ -11,6 +11,7 @@`
`11`	`11`	`"path": "nvflare.app_opt.xgboost.histogram_based_v2.executor.FedXGBHistogramExecutor",`
`12`	`12`	`"args": {`
`13`	`13`	`"data_loader_id": "dataloader",`
	`14`	`+ "metrics_writer_id": "metrics_writer",`
`14`	`15`	`"early_stopping_rounds": 2,`
`15`	`16`	`"xgb_params": {`
`16`	`17`	`"max_depth": 8,`
`@@ -33,6 +34,16 @@`
`33`	`34`	`"args": {`
`34`	`35`	`"data_split_filename": "data_split.json"`
`35`	`36`	`}`
	`37`	`+ },`
	`38`	`+ {`
	`39`	`+ "id": "metrics_writer",`
	`40`	`+ "path": "nvflare.app_opt.tracking.tb.tb_writer.TBWriter",`
	`41`	`+ "args": {"event_type": "analytix_log_stats"}`
	`42`	`+ },`
	`43`	`+ {`
	`44`	`+ "id": "event_to_fed",`
	`45`	`+ "name": "ConvertToFedEvent",`
	`46`	`+ "args": {"events_to_convert": ["analytix_log_stats"], "fed_event_prefix": "fed."}`
`36`	`47`	`}`
`37`	`48`	`]`
`38`	`49`	`}`