[BUG] TargetEncoding with multiple target columns makes targets to be switches

**Describe the bug**
When `TargetEncoding` op is used with multiple target columns, it might switch the content of the target columns.
Furthermore, the internal statistics (count, sum) saved with the NVT workflow for the target columns are also switched.

**Steps/Code to reproduce bug**
- Download the original Sharechat dataset from [RecSys Challenge'23 website](https://sharechat.com/recsys2023) or directly from this [direct link](https://drive.google.com/file/d/1Ayb8T308B5ehRJrYGddPzmezFuL1vYUz/view?usp=sharing)
- Start a container from Merlin TF image
```bash
docker run --runtime=nvidia --rm -it --ipc=host --cap-add SYS_NICE -v /mnt/nvme0n1/datasets/recsyschallenge2023_shrcht/dataset/raw:/data -v /mnt/nvme0n1/datasets/recsyschallenge2023_shrcht/outputs:/outputs -p 8888:8888 nvcr.io/nvidia/merlin/merlin-tensorflow:23.05 /bin/bash
```
- Go to the Quick-start for ranking example folder: `cd /Merlin/examples/quick_start/scripts/preproc`
- Edit the `preprocessing.py` to add a TargetEncoding op inside the `generate_nvt_workflow_features()` method, like this. Notice it includes multiple target columns: `["is_clicked", "is_installed"]`

```python
    def generate_nvt_workflow_features(self):
         ...
        outputs = reduce(lambda x, y: x + y, list(feats.values()))

        ###################### ADD THIS ######################
        target_encoding = (
            "f_2,f_3,f_4,f_5,f_6,f_7,f_8,f_9,f_10,f_11,f_12,f_13,"
            "f_14,f_15,f_16,f_17,"
            "f_18,f_19,f_20,f_21,f_22,f_23,f_24,f_25,f_26,f_27,f_28"
            ",f_29,f_30,f_31,f_32".split(",")
            >> nvt.ops.TargetEncoding(
                ["is_clicked", "is_installed"],
                kfold=5,
                p_smooth=10,
                out_dtype="float32",
            )
        )
        
        outputs = outputs + target_encoding
        ######################
        
        workflow = nvt.Workflow(outputs, client=self.dask_cluster_client)
```
- Run the preprocessing script
```bash
cd /quick_start/scripts/preproc/
OUT_DATASET_PATH=/outputs/
python preprocessing_shrcht.py --input_data_format=tsv --csv_na_values="" --data_path="/data/train/*.csv" --output_path=$OUT_DATASET_PATH/shrcht_preproc_01_te/ --predict_data_path $OUT_DATASET_PATH/shrcht_preproc_01_te/predict --control_features="f_0" --categorical_features="f_2,f_3,f_4,f_5,f_6,f_7,f_8,f_9,f_10,f_11,f_12,f_13,f_14,f_15,f_16,f_17,f_18,f_19,f_20,f_21,f_22,f_23,f_24,f_25,f_26,f_27,f_28,f_29,f_30,f_31,f_32" --continuous_features="f_33,f_34,f_35,f_36,f_37,f_38,f_39,f_40,f_41,f_42,f_43,f_44,f_45,f_46,f_47,f_48,f_49,f_50,f_51,f_52,f_53,f_54,f_55,f_56,f_57,f_58,f_59,f_60,f_61,f_62,f_63,f_64,f_65,f_66,f_67,f_68,f_69,f_70,f_71,f_72,f_73,f_74,f_75,f_76,f_77,f_78,f_79" --continuous_features_fillna="median" --binary_classif_targets="is_clicked,is_installed" --to_int8="is_clicked,is_installed"  
```
- Inspect the output preprocessed parquet files in `/outputs/shrcht_preproc_01_te/train`. You will notice that the values of the targets `is_clicked` and `is_installed` are now switched compared to the original raw data. P.s. You can use the column `f_0` as the primary key to find the corresponding rows in the raw train dataset and the preprocessed dataset.
- Now inspect the NVT workflow statistics parquet file for target encoding, found in `/outputs/shrcht_preproc_01_te/workflow/categories/cat_stats.__fold___f_6.parquet`. That file contains the count and sum of each categorical value of f_6 with respect with the targets. If you compute those statistics manually from raw data (e.g. using something like ddf.groupby('f_6')[["is_clicked", "is_installed"]].agg("sum"), you will notice that the sum of the targets are switched compared to the raw data (i.e. sum of positive "is_installed" events will be higher than positive "is_clicked", which is typically not the real scenario).

- Now change again the `preprocessing.py` script and split that `TargetEncoding` op in two ops, one for each target, like this
```python
    def generate_nvt_workflow_features(self):
         ...
        outputs = reduce(lambda x, y: x + y, list(feats.values()))

        ###################### ADD THIS ######################
        target_encoding_clicked = (
            "f_2,f_3,f_4,f_5,f_6,f_7,f_8,f_9,f_10,f_11,f_12,f_13,"
            "f_14,f_15,f_16,f_17,"
            "f_18,f_19,f_20,f_21,f_22,f_23,f_24,f_25,f_26,f_27,f_28"
            ",f_29,f_30,f_31,f_32".split(",")
            >> nvt.ops.TargetEncoding(
                ["is_clicked"], kfold=5, p_smooth=10, out_dtype="float32",
            )
        )

        target_encoding_installed = (
            "f_2,f_3,f_4,f_5,f_6,f_7,f_8,f_9,f_10,f_11,f_12,f_13,"
            "f_14,f_15,f_16,f_17,"
            "f_18,f_19,f_20,f_21,f_22,f_23,f_24,f_25,f_26,f_27,f_28"
            ",f_29,f_30,f_31,f_32".split(",")
            >> nvt.ops.TargetEncoding(
                ["is_installed"], kfold=5, p_smooth=10, out_dtype="float32",
            )
        )

        outputs = outputs + target_encoding_clicked + target_encoding_installed
        ######################

        workflow = nvt.Workflow(outputs, client=self.dask_cluster_client)
```
- If you check the generated files now, you'll see that the values for target columns and TE features is now correct


**Expected behavior**
`TargetEncoding` should not switch the target columns values and also target encoded feature values.

**Environment details (please complete the following information):**

- Environment location: Merlin TF container 22.05



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] TargetEncoding with multiple target columns makes targets to be switches #1839

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] TargetEncoding with multiple target columns makes targets to be switches #1839

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions