-
Notifications
You must be signed in to change notification settings - Fork 604
TargetEncoder in cuml.accel
#7476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
|
||
|
|
||
| class TargetEncoder: | ||
| class TargetEncoder(InteropMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For everything to work properly you'll also need to move TargetEncoder to be a subclass of Base. Looking at the current implementation, I think this would entail:
- Adding
Baseto the base class list (it should go first, before the mixin) - Updating the definition of
_get_param_namesto also includesuper()._get_param_names()(please also move this definition to the top, as we've done on other estimators). - Ripping out the custom infra in the class like
_get_output_type/get_params/.... Basically everything that's not there to implementfit/fit_transform/transformshould be moved to use theBaseinfra. - Adding
CumlArrayreturn type annotations fromtransform/fit_transformto enable method type reflection - Possibly using a
CumlArrayDescriptorto reflect fitted attributes, though from looking at the list in the sklearn docs I don't think that's necessary. - Ensuring we have adequate test coverage for this estimator so we're not unexpectedly breaking things. Since this wasn't a
Basesubclass and wasn't doing type reflection the way we do elsewhere I wouldn't be surprised if after this we see differences in behavior, but if we're moving towards our expected standard I'd view those "breaking changes" as more bugfixes since this estimator doesn't follow our conventions.
Overall it looks like there's a bunch of cleanup work to do in this estimator, making this ready for cuml-accel is not necessarily a light lift.
| "target_mean_": to_cpu(self.mean), | ||
| "n_features_in_": len(self.train.columns) - 3, | ||
| **super()._attrs_to_cpu(model), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_attrs_to_cpu and _attrs_from_cpu should really roundtrip all fitted state, right now it looks like you're pulling from categories_ in _attrs_from_cpu but not setting them in _attrs_to_cpu, which is a bug.
Resolves #7154
This PR adds support for TargetEncoder in cuml.accel. This feature was originally requested by the kaggle team.
There are API differences between cuml and sklearn's implementation of TargetEncoder and these differences must be handled in translating between cpu and gpu models.
n_foldscvseedrandom_staterandom_stateisNone, defaults to42smoothsmoothsmooth == "auto", set to1.0; elsefloat(model.smooth)split_methodshuffle"random"ifshuffle=True, otherwise"continuous"output_type"auto"stat"mean"categories"auto"target_type"continuous"Testing upstream
./run-tests.sh -k "targetencoder"Current failures
Testing local
pytest test_sklearn_import_export.py -k "target_encoder"Current failures