NVIDIA
diff --git a/‎examples/advanced/finance/README.md‎
Lines changed: 67 additions & 0 deletions b/‎examples/advanced/finance/README.md‎
Lines changed: 67 additions & 0 deletions
diff --git a/‎examples/advanced/finance/baseline_xgboost.py‎
Lines changed: 130 additions & 0 deletions b/‎examples/advanced/finance/baseline_xgboost.py‎
Lines changed: 130 additions & 0 deletions
diff --git a/‎examples/advanced/finance/jobs/2_bagging/app_server/config/config_fed_server.json‎
Lines changed: 49 additions & 0 deletions b/‎examples/advanced/finance/jobs/2_bagging/app_server/config/config_fed_server.json‎
Lines changed: 49 additions & 0 deletions
diff --git a/‎examples/advanced/finance/jobs/2_bagging/app_site-1/config/config_fed_client.json‎
Lines changed: 42 additions & 0 deletions b/‎examples/advanced/finance/jobs/2_bagging/app_site-1/config/config_fed_client.json‎
Lines changed: 42 additions & 0 deletions
diff --git a/‎examples/advanced/finance/jobs/2_bagging/app_site-1/custom/data_loader.py‎
Lines changed: 77 additions & 0 deletions b/‎examples/advanced/finance/jobs/2_bagging/app_site-1/custom/data_loader.py‎
Lines changed: 77 additions & 0 deletions
@@ -0,0 +1,67 @@
+# Financial Application with Federated XGBoost Methods
+This example illustrates the use of [NVIDIA FLARE](https://nvflare.readthedocs.io/en/main/index.html) on a financial application. 
+These examples show how to use [XGBoost](https://github.com/dmlc/xgboost) in various ways to train a model in a federated manner to perform fraud detection with a 
+[finance dataset](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud).
+
+## Federated Training of XGBoost
+Several mechanisms have been proposed for training an XGBoost model in a federated learning setting.
+In these examples, we illustrate the use of NVFlare to carry out the following four approaches:
+- *vertical* federated learning using histogram-based collaboration
+- *horizontal* federated learning using three approaches: 
+  - histogram-based collaboration 
+  - tree-based collaboration with cyclic federation
+  - tree-based collaboration with bagging federation
+
+For more details, please refer to the READMEs for 
+[vertical](https://github.com/NVIDIA/NVFlare/blob/main/examples/advanced/vertical_xgboost/README.md), 
+[histogram-based](https://github.com/NVIDIA/NVFlare/tree/main/examples/advanced/xgboost/histogram-based/README.md),
+and [tree-based](https://github.com/NVIDIA/NVFlare/tree/main/examples/advanced/xgboost/tree-based/README.md) 
+methods.
+
+## Data Preparation
+### Download and Store Data
+To run the examples, we first download the dataset from the link above, which is a single `.csv` file.
+By default, we assume the dataset is downloaded, uncompressed, and stored in `${PWD}/dataset/creditcard.csv`.
+
+> **_NOTE:_** If the dataset is downloaded in another place,
+> make sure to modify the corresponding `DATASET_PATH` inside `prepare_data.sh`.
+
+### Data Split
+We first split the dataset into two parts: training and testing. Then perform data split for each client under both horizontal and vertical settings.
+
+Data splits used in this example can be generated with
+```
+bash prepare_data.sh
+```
+
+This will generate data splits for 2 clients under all experimental settings. Note that the overlapping ratio between clients for vertical setting is 1.0 by default, so that the training data amount is the same as horizontal experiments.
+If you want to customize for your experiments to simulate more realistic scenarios, please check their corresponding scripts under `utils/`.
+
+> **_NOTE:_** The generated data files will be stored in the folder `/tmp/dataset/`,
+> and will be used by jobs by specifying the path within `config_fed_client.json` 
+
+## Run experiments for all settings
+To run all experiments, we provide a script for all settings.
+```
+bash run_training.sh
+```
+This will cover baseline centralized training, horizontal FL with histogram-based, tree-based cyclic, and tree-based bagging
+collaborations, as well as vertical FL.
+
+Then we test the resulting models on the test dataset with 
+```
+bash run_testing.sh
+``` 
+The results are as follows:
+```
+Testing baseline_xgboost
+AUC score:  0.965017768854869
+Testing xgboost_vertical
+AUC score:  0.9650650531737737
+Testing xgboost_horizontal_histogram
+AUC score:  0.9579533839422094
+Testing xgboost_horizontal_cyclic
+AUC score:  0.9688269828190139
+Testing xgboost_horizontal_bagging
+AUC score:  0.9713936151275366
+```
@@ -0,0 +1,130 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+
+import pandas as pd
+import xgboost as xgb
+from sklearn.model_selection import train_test_split
+
+
+def xgboost_args_parser():
+    parser = argparse.ArgumentParser(description="Centralized XGBoost training with random forest option")
+    parser.add_argument(
+        "--train_data_path",
+        type=str,
+        default="./dataset/train.csv",
+        help="folder to training dataset file",
+    )
+    parser.add_argument(
+        "--test_data_path",
+        type=str,
+        default="./dataset/test.csv",
+        help="folder to testing dataset file",
+    )
+    parser.add_argument("--valid_ratio", type=float, default=0.1, help="ratio of validation split")
+    parser.add_argument("--num_rounds", type=int, default=100, help="number of boosting rounds")
+    parser.add_argument("--num_parallel_tree", type=int, default=1, help="number of parallel tree")
+    parser.add_argument(
+        "--output_folder",
+        type=str,
+        default="./workspaces/xgboost_workspace_centralized",
+        help="model output folder",
+    )
+    return parser
+
+
+def prepare_data(data_path: str):
+    df = pd.read_csv(data_path)
+    print(df.info())
+    print(df.head())
+    total_data_num = df.shape[0]
+    print(f"Total data count: {total_data_num}")
+    # Split to feature and label
+    X = df.iloc[:, 1:]
+    y = df.iloc[:, 0]
+    print(y.value_counts())
+    return X, y
+
+
+def get_training_parameters(args):
+    # use logistic regression loss for binary classification
+    # use auc as metric
+    param = {
+        "objective": "binary:logistic",
+        "eta": 0.1,
+        "max_depth": 8,
+        "eval_metric": "auc",
+        "nthread": 16,
+        "num_parallel_tree": args.num_parallel_tree,
+    }
+    return param
+
+
+def main():
+    parser = xgboost_args_parser()
+    args = parser.parse_args()
+
+    train_data_path = args.train_data_path
+    valid_ratio = args.valid_ratio
+    num_rounds = args.num_rounds
+    output_folder = args.output_folder
+
+    # Set mode file paths
+    model_path = os.path.join(output_folder, "model_centralized.json")
+
+    # Load data
+    start = time.time()
+    X, y = prepare_data(train_data_path)
+
+    # Split to training and validation
+    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=valid_ratio, random_state=77)
+    print(
+        f"TRAINING: X_train: {X_train.shape}, y_train: {y_train.shape}, Fraudulant transaction: {y_train.value_counts()[1]}"
+    )
+    print(
+        f"VALIDATION: X_validate: {X_valid.shape}, y_validate: {y_valid.shape}, Fraudulant transaction: {y_valid.value_counts()[1]}"
+    )
+
+    # construct xgboost DMatrix
+    dmat_train = xgb.DMatrix(X_train, label=y_train)
+    dmat_valid = xgb.DMatrix(X_valid, label=y_valid)
+
+    end = time.time()
+    lapse_time = end - start
+    print(f"Data loading time: {lapse_time}")
+
+    # xgboost training
+    start = time.time()
+    xgb_params = get_training_parameters(args)
+    bst = xgb.train(
+        xgb_params,
+        dmat_train,
+        num_boost_round=num_rounds,
+        evals=[(dmat_valid, "validate"), (dmat_train, "train")],
+    )
+    end = time.time()
+    lapse_time = end - start
+    print(f"Training time: {lapse_time}")
+
+    # save model
+    if not os.path.exists(output_folder):
+        os.makedirs(output_folder, exist_ok=True)
+    bst.save_model(model_path)
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,49 @@
+{
+    "format_version": 2,
+    "server": {
+        "heart_beat_timeout": 600,
+        "task_request_interval": 0.05
+    },
+    "task_data_filters": [],
+    "task_result_filters": [],
+    "components": [
+        {
+            "id": "persistor",
+            "path": "nvflare.app_opt.xgboost.tree_based.model_persistor.XGBModelPersistor",
+            "args": {
+                "save_name": "xgboost_model.json"
+            }
+        },
+        {
+            "id": "shareable_generator",
+            "path": "nvflare.app_opt.xgboost.tree_based.shareable_generator.XGBModelShareableGenerator",
+            "args": {}
+        },
+        {
+            "id": "aggregator",
+            "path": "nvflare.app_opt.xgboost.tree_based.bagging_aggregator.XGBBaggingAggregator",
+            "args": {}
+        }
+    ],
+    "workflows": [
+        {
+            "id": "scatter_and_gather",
+            "name": "ScatterAndGather",
+            "args": {
+                "min_clients": 2,
+                "num_rounds": 100,
+                "start_round": 0,
+                "wait_time_after_min_received": 0,
+                "aggregator_id": "aggregator",
+                "persistor_id": "persistor",
+                "shareable_generator_id": "shareable_generator",
+                "train_task_name": "train",
+                "train_timeout": 0,
+                "allow_empty_global_weights": true,
+                "task_check_period": 0.01,
+                "persist_every_n_rounds": 0,
+                "snapshot_every_n_rounds": 0
+            }
+        }
+    ]
+}
@@ -0,0 +1,42 @@
+{
+    "format_version": 2,
+    "executors": [
+        {
+            "tasks": [
+                "train"
+            ],
+            "executor": {
+                "id": "Executor",
+                "name": "FedXGBTreeExecutor",
+                "args": {
+                    "data_loader_id": "dataloader",
+                    "training_mode": "bagging",
+                    "num_client_bagging": 2,
+                    "num_local_parallel_tree": 1,
+                    "local_subsample": 1,
+                    "lr_mode": "uniform",
+                    "local_model_path": "model.json",
+                    "global_model_path": "model_global.json",
+                    "learning_rate": 0.1,
+                    "objective": "binary:logistic",
+                    "max_depth": 8,
+                    "eval_metric": "auc",
+                    "tree_method": "hist",
+                    "nthread": 16,
+                    "lr_scale": 0.49999756170115234
+                }
+            }
+        }
+    ],
+    "task_result_filters": [],
+    "task_data_filters": [],
+    "components": [
+        {
+            "id": "dataloader",
+            "path": "data_loader.DataLoader",
+            "args": {
+                "data_split_filename": "/tmp/dataset/horizontal_xgb_data/data_site-1.json"
+            }
+        }
+    ]
+}
@@ -0,0 +1,77 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import pandas as pd
+import xgboost as xgb
+
+from nvflare.app_opt.xgboost.data_loader import XGBDataLoader
+
+
+def _read_with_pandas(data_path, start: int, end: int):
+    data_size = end - start
+    # skip rows for different sites but keep the header
+    data = pd.read_csv(data_path, skiprows=range(1, start), nrows=data_size)
+    data_num = data.shape[0]
+    # split to feature and label
+    x = data.iloc[:, 1:].copy()
+    y = data.iloc[:, 0].copy()
+
+    return x, y, data_num
+
+
+class DataLoader(XGBDataLoader):
+    def __init__(self, data_split_filename):
+        """Reads dataset and return XGB data matrix.
+
+        Args:
+            data_split_filename: file name to data splits
+        """
+        self.data_split_filename = data_split_filename
+
+    def load_data(self, client_id: str):
+        with open(self.data_split_filename, "r") as file:
+            data_split = json.load(file)
+
+        data_path = data_split["data_path"]
+        data_index = data_split["data_index"]
+
+        # check if site_id and "valid" in the mapping dict
+        if client_id not in data_index.keys():
+            raise ValueError(
+                f"Data does not contain Client {client_id} split",
+            )
+
+        if "valid" not in data_index.keys():
+            raise ValueError(
+                "Data does not contain Validation split",
+            )
+
+        site_index = data_index[client_id]
+        valid_index = data_index["valid"]
+
+        # training
+        x_train, y_train, total_train_data_num = _read_with_pandas(
+            data_path=data_path, start=site_index["start"], end=site_index["end"]
+        )
+        dmat_train = xgb.DMatrix(x_train, label=y_train)
+
+        # validation
+        x_valid, y_valid, total_valid_data_num = _read_with_pandas(
+            data_path=data_path, start=valid_index["start"], end=valid_index["end"]
+        )
+        dmat_valid = xgb.DMatrix(x_valid, label=y_valid)
+
+        return dmat_train, dmat_valid