Decision Tree Ensemble Model

Overview

A C++ library implementing and benchmarking five supervised learning methods:

DecisionTree: a single CART decision tree implemented from scratch, supporting mean squared and absolute error criteria with optional OpenMP parallelism.
Bagging: bootstrap aggregation of multiple decision trees to reduce variance, with configurable number of trees and tree hyperparameters.
Boosting: a custom gradient boosting implementation that sequentially trains weak learners to minimize a specified loss, featuring early stopping and learning rate control.
LightGBM: integration of Microsoft’s LightGBM library for fast, histogram-based gradient boosting with support for large datasets.
AdvancedGBDT: a custom GBDT variant with DART-style dropout and flexible binning methods (quantile or frequency), for improved regularization and performance.

A Python benchmarking script and plotting utilities automate experiments and generate performance graphs.

Prerequisites

Common

A C++17‑capable compiler (Clang or GCC).
CMake 3.10 or higher.
Graphviz (for tree visualizations).
Python 3.8+ and pip.

Linux (Debian/Ubuntu)

apt-get update
apt-get install cmake build-essential libomp-dev graphviz python3 python3-pip lightgbm

macOS (Homebrew)

brew update
brew install cmake libomp graphviz python3 lightgbm

Python

python3 -m pip install --user -r requirements.txt
# Plotting dependencies:
python3 -m pip install matplotlib pandas
# LightGBM bindings:
python3 -m pip install lightgbm

Environment Variables

OMP_NUM_THREADS: controls the number of threads for OpenMP-enabled models.
USE_MPI (CMake flag): enable MPI-based parallelism for Bagging.

Building the C++ Project

mkdir build && cd build
# enable or disable OpenMP
cmake -DOPENMP=ON ..
# To build with MPI support (for Bagging):
cmake -DUSE_MPI=ON -DOPENMP=ON ..
make

This produces three executables in build/:

DataClean (CSV preprocessing)
MainEnsemble (single-run benchmarking)
MainKFold (k‑fold cross‑validation)

Usage Examples

Data Cleaning

./DataClean ../data/raw.csv ../data/clean.csv

Single Experiment Suite

./MainEnsemble [OPTIONS]

Launches an interactive menu to choose one of five methods and optionally override hyperparameters via command‑line flags, e.g.:

./MainEnsemble 2 --n_estimators=100 --max_depth=10 --use_omp=1

Common Command-Line Flags

--n_estimators=<int>: number of trees/estimators.
--max_depth=<int>: maximum tree depth.
--learning_rate=<float>: shrinkage rate for boosting.
--use_omp=<0|1>: disable (0) or enable (1) OpenMP parallelism.
--min_data_leaf=<int>: minimum samples per leaf for AdvancedGBDT.
--num_leaves=<int>: max leaf count for LightGBM.

Hyperparameter Details

DecisionTree

--max_depth=<int> (default: 60): Maximum depth of the tree.
--min_samples_split=<int> (default: 2): Minimum number of samples required to split an internal node.
--min_impurity_decrease=<float> (default: 1e-12): Minimum impurity decrease required to split a node.
--use_split_histogram=<0|1> (default: 0): Enable (1) or disable (0) histogram-based splitting.
--use_omp=<0|1> (default: 0): Enable (1) or disable (0) OpenMP parallelism.
--num_threads=<int> (default: 1): Number of threads to use when OpenMP is enabled.

Bagging

--n_estimators=<int> (default: 20): Number of trees to aggregate.
--max_depth=<int> (default: 60): Maximum depth for each base tree.
--min_samples_split=<int> (default: 2): Minimum samples to split a node in base trees.
--min_impurity_decrease=<float> (default: 1e-6): Impurity threshold for splitting in base trees.
--which_loss_function=<0|1> (default: 0): Loss function for aggregation: 0=MSE, 1=MAE.
--use_split_histogram=<0|1> (default: 0): Enable histogram splitting in base trees.
--use_omp=<0|1> (default: 0): Enable OpenMP parallelism.
--num_threads=<int> (default: 1): Number of threads per tree when using OpenMP.

Boosting (Custom)

--n_estimators=<int> (default: 75): Number of boosting iterations.
--learning_rate=<float> (default: 0.07): Shrinkage rate of each new tree.
--max_depth=<int> (default: 15): Maximum depth of each weak learner.
--min_samples_split=<int> (default: 3): Minimum samples to split nodes in weak learners.
--min_impurity_decrease=<float> (default: 1e-5): Impurity threshold for splitting in weak learners.
--which_loss_function=<0|1> (default: 0): Loss function: 0=MSE, 1=MAE.
--use_split_histogram=<0|1> (default: 1): Enable histogram-based splitting.
--use_omp=<0|1> (default: 0): Enable OpenMP parallelism.
--num_threads=<int> (default: 1): Threads per iteration when using OpenMP.

LightGBM

--n_estimators=<int> (default: 100): Number of boosting rounds.
--learning_rate=<float> (default: 0.1): Learning rate (shrinkage).
--max_depth=<int> (default: -1): Maximum tree depth (-1 for no limit).
--num_leaves=<int> (default: 31): Maximum leaves per tree.
--subsample=<float> (default: 1.0): Fraction of data to use per iteration.
--colsample_bytree=<float> (default: 1.0): Fraction of features to use.

AdvancedGBDT

--n_estimators=<int> (default: 200): Number of trees.
--learning_rate=<float> (default: 0.01): Learning rate.
--max_depth=<int> (default: 50): Maximum depth per tree.
--min_data_leaf=<int> (default: 1): Minimum data per leaf.
--num_bins=<int> (default: 1024): Number of bins for feature histograms.
--use_dart=<0|1> (default: 1): Enable DART dropout technique.
--dropout_rate=<float> (default: 0.5): Dropout rate for DART.
--skip_drop_rate=<float> (default: 0.3): Skip-drop probability for DART.
--binning_method=<0|1> (default: 1): Binning method: 0=Quantile, 1=Frequency.

K‑Fold Cross‑Validation

./MainKFold

Select a method and number of folds to run cross‑validation.

Python Benchmarking & Plotting

From the project root:

cd script
python3 benchmark.py   # runs experiments, writes CSV
cd ../plots
python3 plot.py        # reads CSV and generates figures

benchmark.py writes its results to script/benchmark_results_extended.csv.
plot.py reads the CSV and outputs figures into plots/figures/ as PNG files.

Project Structure

/decision_tree_ensemble_model
├─ CMakeLists.txt
├─ src/            # C++ source (models, utilities, pipelines)
├─ build/          # build artifacts
├─ script/         # Python script to run batch experiments and export CSV
├─ plots/          # Python script and output directory for generated figures
└─ README.md       # project overview and instructions

Usage Tips

Data Paths: Ensure that datasets are available in the correct path (../datasets/).
Graphviz Installation: If you encounter errors about missing dot, install Graphviz with:
```
sudo apt-get install graphviz
```
Note: If errors persist, make sure dot is in your system's PATH. You can verify this by running:
```
which dot
```
If it is not found, you may need to add Graphviz to your PATH or specify the full path to the dot binary.
Execution Errors: If errors like command not found occur, ensure you are running the executables from the build/ directory.
Permissions: If you encounter permission issues when running the executables, you may need to set the executable bit with:
```
chmod +x DataClean MainEnsemble MainKFold
```

Name		Name	Last commit message	Last commit date
Latest commit History 199 Commits
.vscode		.vscode
datasets		datasets
papers		papers
profiling		profiling
results		results
saved_models		saved_models
src		src
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
biblio.md		biblio.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Decision Tree Ensemble Model

Overview

Prerequisites

Common

Linux (Debian/Ubuntu)

macOS (Homebrew)

Python

Environment Variables

Building the C++ Project

Usage Examples

Data Cleaning

Single Experiment Suite

Common Command-Line Flags

Hyperparameter Details

DecisionTree

Bagging

Boosting (Custom)

LightGBM

AdvancedGBDT

K‑Fold Cross‑Validation

Python Benchmarking & Plotting

Project Structure

Usage Tips

License / Ownership

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

NicoBeCodin/decision_tree_ensemble_model

Folders and files

Latest commit

History

Repository files navigation

Decision Tree Ensemble Model

Overview

Prerequisites

Common

Linux (Debian/Ubuntu)

macOS (Homebrew)

Python

Environment Variables

Building the C++ Project

Usage Examples

Data Cleaning

Single Experiment Suite

Common Command-Line Flags

Hyperparameter Details

DecisionTree

Bagging

Boosting (Custom)

LightGBM

AdvancedGBDT

K‑Fold Cross‑Validation

Python Benchmarking & Plotting

Project Structure

Usage Tips

License / Ownership

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages