allow grid search for classifiers/regressors params in ensemble methods (#259)

rasbt · web-flow · commit 5735e00afcf8 · 2017-10-02T02:36:40.000-04:00
diff --git a/docs/sources/CHANGELOG.md b/docs/sources/CHANGELOG.md
@@ -18,6 +18,7 @@ The CHANGELOG for the current development version is available at
 - Added `evaluate.permutation_test`, a permutation test for hypothesis testing (or A/B testing) to test if two samples come from the same distribution. Or in other words, a procedure to test the null hypothesis that that two groups are not significantly different (e.g., a treatment and a control group).
 - Added `'leverage'` and `'conviction` as evaluation metrics to the `frequent_patterns.association_rules` function. [#246](https://github.com/rasbt/mlxtend/pull/246) & [#247](https://github.com/rasbt/mlxtend/pull/247)
 - Added a `loadings_` attribute to `PrincipalComponentAnalysis` to compute the factor loadings of the features on the principal components. [#251](https://github.com/rasbt/mlxtend/pull/251)
+- Allow grid search over classifiers/regressors in ensemble and stacking estimators [#259](https://github.com/rasbt/mlxtend/pull/259)
 
 ##### Changes
 
diff --git a/docs/sources/user_guide/classifier/EnsembleVoteClassifier.ipynb b/docs/sources/user_guide/classifier/EnsembleVoteClassifier.ipynb
@@ -459,6 +459,20 @@
     "grid = grid.fit(iris.data, iris.target)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Note**\n",
+    "\n",
+    "The `EnsembleVoteClass` also enables grid search over the `clfs` argument. However, due to the current implementation of `GridSearchCV` in scikit-learn, it is not possible to search over both, differenct classifiers and classifier parameters at the same time. For instance, while the following parameter dictionary works\n",
+    "\n",
+    "    params = {'randomforestclassifier__n_estimators': [1, 100],\n",
+    "    'clfs': [(clf1, clf1, clf1), (clf2, clf3)]}\n",
+    "    \n",
+    "it will use the instance settings of `clf1`, `clf2`, and `clf3` and not overwrite it with the `'n_estimators'` settings from `'randomforestclassifier__n_estimators': [1, 100]`."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
diff --git a/docs/sources/user_guide/classifier/StackingCVClassifier.ipynb b/docs/sources/user_guide/classifier/StackingCVClassifier.ipynb
@@ -423,6 +423,20 @@
     "print('Accuracy: %.2f' % grid.best_score_)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Note**\n",
+    "\n",
+    "The `StackingCVClassifier` also enables grid search over the `classifiers` argument. However, due to the current implementation of `GridSearchCV` in scikit-learn, it is not possible to search over both, differenct classifiers and classifier parameters at the same time. For instance, while the following parameter dictionary works\n",
+    "\n",
+    "    params = {'randomforestclassifier__n_estimators': [1, 100],\n",
+    "    'classifiers': [(clf1, clf1, clf1), (clf2, clf3)]}\n",
+    "    \n",
+    "it will use the instance settings of `clf1`, `clf2`, and `clf3` and not overwrite it with the `'n_estimators'` settings from `'randomforestclassifier__n_estimators': [1, 100]`."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
diff --git a/docs/sources/user_guide/classifier/StackingClassifier.ipynb b/docs/sources/user_guide/classifier/StackingClassifier.ipynb
@@ -400,6 +400,20 @@
     "print('Accuracy: %.2f' % grid.best_score_)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Note**\n",
+    "\n",
+    "The `StackingClassifier` also enables grid search over the `classifiers` argument. However, due to the current implementation of `GridSearchCV` in scikit-learn, it is not possible to search over both, differenct classifiers and classifier parameters at the same time. For instance, while the following parameter dictionary works\n",
+    "\n",
+    "    params = {'randomforestclassifier__n_estimators': [1, 100],\n",
+    "    'classifiers': [(clf1, clf1, clf1), (clf2, clf3)]}\n",
+    "    \n",
+    "it will use the instance settings of `clf1`, `clf2`, and `clf3` and not overwrite it with the `'n_estimators'` settings from `'randomforestclassifier__n_estimators': [1, 100]`."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
diff --git a/docs/sources/user_guide/regressor/StackingCVRegressor.ipynb b/docs/sources/user_guide/regressor/StackingCVRegressor.ipynb
@@ -278,6 +278,20 @@
     "print('Accuracy: %.2f' % grid.best_score_)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Note**\n",
+    "\n",
+    "The `StackingCVRegressor` also enables grid search over the `regressors` argument. However, due to the current implementation of `GridSearchCV` in scikit-learn, it is not possible to search over both, differenct classifiers and classifier parameters at the same time. For instance, while the following parameter dictionary works\n",
+    "\n",
+    "    params = {'randomforestregressor__n_estimators': [1, 100],\n",
+    "    'regressors': [(regr1, regr1, regr1), (regr2, regr3)]}\n",
+    "    \n",
+    "it will use the instance settings of `regr1`, `regr2`, and `regr3` and not overwrite it with the `'n_estimators'` settings from `'randomforestregressor__n_estimators': [1, 100]`."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
diff --git a/docs/sources/user_guide/regressor/StackingRegressor.ipynb b/docs/sources/user_guide/regressor/StackingRegressor.ipynb
@@ -77,7 +77,9 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
    "outputs": [],
    "source": [
     "from mlxtend.regressor import StackingRegressor\n",
@@ -604,7 +606,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In case we are planning to use a regression algorithm multiple times, all we need to do is to add an additional number suffix in the parameter grid as shown below:"
+    "**Note**\n",
+    "\n",
+    "The `StackingRegressor` also enables grid search over the `regressors` argument. However, due to the current implementation of `GridSearchCV` in scikit-learn, it is not possible to search over both, differenct classifiers and classifier parameters at the same time. For instance, while the following parameter dictionary works\n",
+    "\n",
+    "    params = {'randomforestregressor__n_estimators': [1, 100],\n",
+    "    'regressors': [(regr1, regr1, regr1), (regr2, regr3)]}\n",
+    "    \n",
+    "it will use the instance settings of `regr1`, `regr2`, and `regr3` and not overwrite it with the `'n_estimators'` settings from `'randomforestregressor__n_estimators': [1, 100]`."
    ]
   },
   {
diff --git a/mlxtend/classifier/ensemble_vote.py b/mlxtend/classifier/ensemble_vote.py
@@ -255,10 +255,7 @@ def get_params(self, deep=True):
 
             for key, value in six.iteritems(super(EnsembleVoteClassifier,
                                             self).get_params(deep=False)):
-                if key == 'clfs':
-                    continue
-                else:
-                    out['%s' % key] = value
+                out['%s' % key] = value
             return out
 
     def _predict(self, X):
diff --git a/mlxtend/classifier/stacking_classification.py b/mlxtend/classifier/stacking_classification.py
@@ -141,10 +141,7 @@ def get_params(self, deep=True):
 
             for key, value in six.iteritems(super(StackingClassifier,
                                             self).get_params(deep=False)):
-                if key in ('classifiers', 'meta-classifier'):
-                    continue
-                else:
-                    out['%s' % key] = value
+                out['%s' % key] = value
 
             return out
 
diff --git a/mlxtend/classifier/stacking_cv_classification.py b/mlxtend/classifier/stacking_cv_classification.py
@@ -245,10 +245,7 @@ def get_params(self, deep=True):
 
             for key, value in six.iteritems(super(StackingCVClassifier,
                                             self).get_params(deep=False)):
-                if key in ('classifiers', 'meta-classifier'):
-                    continue
-                else:
-                    out['%s' % key] = value
+                out['%s' % key] = value
 
             return out
 
diff --git a/mlxtend/classifier/tests/test_ensemble_vote_classifier.py b/mlxtend/classifier/tests/test_ensemble_vote_classifier.py
@@ -8,6 +8,7 @@
 from sklearn.linear_model import LogisticRegression
 from sklearn.naive_bayes import GaussianNB
 from sklearn.ensemble import RandomForestClassifier
+from sklearn.neighbors import KNeighborsClassifier
 import numpy as np
 from sklearn import datasets
 from sklearn.model_selection import GridSearchCV
@@ -86,3 +87,38 @@ def test_EnsembleVoteClassifier_gridsearch_enumerate_names():
 
     grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5)
     grid = grid.fit(iris.data, iris.target)
+
+
+def test_get_params():
+    clf1 = KNeighborsClassifier(n_neighbors=1)
+    clf2 = RandomForestClassifier(random_state=1)
+    clf3 = GaussianNB()
+    eclf = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3])
+
+    got = sorted(list({s.split('__')[0] for s in eclf.get_params().keys()}))
+    expect = ['clfs',
+              'gaussiannb',
+              'kneighborsclassifier',
+              'randomforestclassifier',
+              'refit',
+              'verbose',
+              'voting',
+              'weights']
+    assert got == expect, got
+
+
+def test_classifier_gridsearch():
+    clf1 = KNeighborsClassifier(n_neighbors=1)
+    clf2 = RandomForestClassifier(random_state=1)
+    clf3 = GaussianNB()
+    eclf = EnsembleVoteClassifier(clfs=[clf1])
+
+    params = {'clfs': [[clf1, clf1, clf1], [clf2, clf3]]}
+
+    grid = GridSearchCV(estimator=eclf,
+                        param_grid=params,
+                        cv=5,
+                        refit=True)
+    grid.fit(X, y)
+
+    assert len(grid.best_params_['clfs']) == 2
diff --git a/mlxtend/classifier/tests/test_stacking_classifier.py b/mlxtend/classifier/tests/test_stacking_classifier.py
@@ -241,3 +241,44 @@ def test_use_features_in_secondary_predict_proba():
     y_pred = sclf.predict_proba(X[idx])[:, 0]
     expect = np.array([0.911, 0.829, 0.885])
     np.testing.assert_almost_equal(y_pred, expect, 3)
+
+
+def test_get_params():
+    clf1 = KNeighborsClassifier(n_neighbors=1)
+    clf2 = RandomForestClassifier(random_state=1)
+    clf3 = GaussianNB()
+    lr = LogisticRegression()
+    sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
+                              meta_classifier=lr)
+
+    got = sorted(list({s.split('__')[0] for s in sclf.get_params().keys()}))
+    expect = ['average_probas',
+              'classifiers',
+              'gaussiannb',
+              'kneighborsclassifier',
+              'meta-logisticregression',
+              'meta_classifier',
+              'randomforestclassifier',
+              'use_features_in_secondary',
+              'use_probas',
+              'verbose']
+    assert got == expect, got
+
+
+def test_classifier_gridsearch():
+    clf1 = KNeighborsClassifier(n_neighbors=1)
+    clf2 = RandomForestClassifier(random_state=1)
+    clf3 = GaussianNB()
+    lr = LogisticRegression()
+    sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
+                              meta_classifier=lr)
+
+    params = {'classifiers': [[clf1, clf1, clf1], [clf2, clf3]]}
+
+    grid = GridSearchCV(estimator=sclf,
+                        param_grid=params,
+                        cv=5,
+                        refit=True)
+    grid.fit(X, y)
+
+    assert len(grid.best_params_['classifiers']) == 2
diff --git a/mlxtend/classifier/tests/test_stacking_cv_classifier.py b/mlxtend/classifier/tests/test_stacking_cv_classifier.py
@@ -11,6 +11,7 @@
 from sklearn.linear_model import LogisticRegression
 from sklearn.naive_bayes import GaussianNB
 from sklearn.ensemble import RandomForestClassifier
+from sklearn.neighbors import KNeighborsClassifier
 import numpy as np
 from sklearn import datasets
 from mlxtend.utils import assert_raises
@@ -246,3 +247,46 @@ def test_pandas():
         sclf.fit(X_df, iris.target)
     except KeyError as e:
         assert 'are NumPy arrays. If X and y are pandas DataFrames' in str(e)
+
+
+def test_get_params():
+    clf1 = KNeighborsClassifier(n_neighbors=1)
+    clf2 = RandomForestClassifier(random_state=1)
+    clf3 = GaussianNB()
+    lr = LogisticRegression()
+    sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3],
+                                meta_classifier=lr)
+
+    got = sorted(list({s.split('__')[0] for s in sclf.get_params().keys()}))
+    expect = ['classifiers',
+              'cv',
+              'gaussiannb',
+              'kneighborsclassifier',
+              'meta-logisticregression',
+              'meta_classifier',
+              'randomforestclassifier',
+              'shuffle',
+              'stratify',
+              'use_features_in_secondary',
+              'use_probas',
+              'verbose']
+    assert got == expect, got
+
+
+def test_classifier_gridsearch():
+    clf1 = KNeighborsClassifier(n_neighbors=1)
+    clf2 = RandomForestClassifier(random_state=1)
+    clf3 = GaussianNB()
+    lr = LogisticRegression()
+    sclf = StackingCVClassifier(classifiers=[clf1],
+                                meta_classifier=lr)
+
+    params = {'classifiers': [[clf1], [clf1, clf2, clf3]]}
+
+    grid = GridSearchCV(estimator=sclf,
+                        param_grid=params,
+                        cv=5,
+                        refit=True)
+    grid.fit(X, y)
+
+    assert len(grid.best_params_['classifiers']) == 3
diff --git a/mlxtend/regressor/stacking_cv_regression.py b/mlxtend/regressor/stacking_cv_regression.py
@@ -182,9 +182,6 @@ def get_params(self, deep=True):
 
             for key, value in six.iteritems(super(StackingCVRegressor,
                                             self).get_params(deep=False)):
-                if key in ('regressors', 'meta-regressor'):
-                    continue
-                else:
-                    out['%s' % key] = value
+                out['%s' % key] = value
 
             return out
diff --git a/mlxtend/regressor/stacking_regression.py b/mlxtend/regressor/stacking_regression.py
@@ -131,10 +131,7 @@ def get_params(self, deep=True):
 
             for key, value in six.iteritems(super(StackingRegressor,
                                             self).get_params(deep=False)):
-                if key in ('regressors', 'meta-regressor'):
-                    continue
-                else:
-                    out['%s' % key] = value
+                out['%s' % key] = value
 
             return out
 
diff --git a/mlxtend/regressor/tests/test_cv_stacking_regression.py b/mlxtend/regressor/tests/test_cv_stacking_regression.py
@@ -103,3 +103,40 @@ def test_gridsearch_numerate_regr():
     grid = grid.fit(X1, y)
     got = round(grid.best_score_, 1)
     assert got >= 0.1 and got <= 0.2, '%f is wrong' % got
+
+
+def test_get_params():
+    lr = LinearRegression()
+    svr_rbf = SVR(kernel='rbf')
+    ridge = Ridge(random_state=1)
+    stregr = StackingCVRegressor(regressors=[ridge, lr],
+                                 meta_regressor=svr_rbf)
+
+    got = sorted(list({s.split('__')[0] for s in stregr.get_params().keys()}))
+    expect = ['cv',
+              'linearregression',
+              'meta-svr',
+              'meta_regressor',
+              'regressors',
+              'ridge',
+              'shuffle',
+              'use_features_in_secondary']
+    assert got == expect, got
+
+
+def test_regressor_gridsearch():
+    lr = LinearRegression()
+    svr_rbf = SVR(kernel='rbf')
+    ridge = Ridge(random_state=1)
+    stregr = StackingCVRegressor(regressors=[lr],
+                                 meta_regressor=svr_rbf)
+
+    params = {'regressors': [[ridge, lr], [lr, ridge, lr]]}
+
+    grid = GridSearchCV(estimator=stregr,
+                        param_grid=params,
+                        cv=5,
+                        refit=True)
+    grid.fit(X1, y)
+
+    assert len(grid.best_params_['regressors']) == 3
diff --git a/mlxtend/regressor/tests/test_stacking_regression.py b/mlxtend/regressor/tests/test_stacking_regression.py