-
Notifications
You must be signed in to change notification settings - Fork 57
Description
Hi Jimmy,
I'm trying to illustrate GOSDT with the diabetes dataset located here, and it seems that the time limit is being ignored. I've tried with continuous features, as well as discretizing on my own, but I can't seem to get anything to return in the amount of time I would expect. I'm running the following code in a Jupyter notebook on a Debian Linux instance with 8 cores and 30GB RAM, so I wouldn't suspect a hardware issue (RAM particularly is hovering around ~1GB used). This example took ~23 minutes on my machine.
## --- env setup --- ##
import pandas as pd
from sklearn.model_selection import train_test_split
import sys
sys.path.append('../GeneralizedOptimalSparseDecisionTrees/python/') # location of cloned GOSDT repo
from model.gosdt import GOSDT
## --- load data (directly from Kaggle location) --- ##
diabetes = pd.read_csv('diabetes.csv')
## --- same training/test split I'm using --- ##
train, _ = train_test_split(diabetes, random_state=0, test_size=0.2)
X = train.drop(columns="Outcome")
y = train['Outcome']
## --- specify and fit model --- ##
hyperparams = {
"regularization": 0.1,
"time_limit":10,
"precision_limit":0.1,
"worker_limit":8,
"verbose": True
}
model = GOSDT(hyperparams)
model.fit(X, pd.DataFrame(y))
print(model.time / 60) # 23.861
A potentially separate issue is that I have only been able to get trees with a single split and two terminal nodes - this is perhaps due in part to the time limit issue not allowing any regularization lower than ~0.1, but I wanted to see if you could offer any advice. Here is the tree I'm getting, pretty much regardless of which combination of hyperparameters I use:
if 144 <= Glucose then:
predicted class: 1
misclassification penalty: 0.065
complexity penalty: 0.1
else if Glucose < 144 then:
predicted class: 0
misclassification penalty: 0.191
complexity penalty: 0.1
Thank you!
-Mitch