add session-based example with pretrained embeddings #1102

radekosmulski · 2023-05-19T10:02:05Z

A branch/PR for sharing code/facilitating dicussion on adding the SIGIR dataset

review-notebook-app · 2023-05-19T10:02:10Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

github-actions · 2023-05-19T10:09:46Z

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-1102

rnyak · 2023-06-05T12:41:28Z

examples/usecases/transformers-next-item-prediction-with-pretrained-embeddings.ipynb

@@ -0,0 +1,731 @@
+{


Line #4.
I think we should not have two workflows.. we need to merge these two csv files first (you can do it with cudf, or pandas upfront, before NVT), and then apply categorify and other ops.. otherwise it will be wrongly mapped.

Reply via ReviewNB

The way you describe it would be a good way to do it, but because of the high cardinality of the pretrained embeddings (50 in this case, but customers might likely encounter even larger sizes) our data would expand very significantly (we would need to add this information onto the 36+ million rows in the dataset).

This is only a WIP, I just wanted to see if I would be able to generate this dataset from a schema (was mostly concerned about the list columns that contain pretrained embedding information).

From what I understand that functionality is still in the works, but what we might be able to do is provide the train set along with information for pretrained embeddings separately to nvtabular/dataloader and the dataloder will be able to do the linking.

And sorry, I just reread your comment - yes, you are spot on, that is a great observation! Yes, if this information were to be provided separately to two workflows, indeed the mapping after Categorify would be broken. But I don't believe we have the machinery available as of now to support providing both at the same time to Categorify (unless we do the merge like you describe, but this is not what we probably will want to do in the final example, once the functionality is fully implemented).

rnyak · 2023-06-05T12:41:28Z

examples/usecases/transformers-next-item-prediction-with-pretrained-embeddings.ipynb

@@ -0,0 +1,731 @@
+{


Line #3. cd /workspace && pip install .
why do we need to do pip install here?

Reply via ReviewNB

I am modifying merlin-models to include the schema for the new datasets (sigir-browsing and sigir-sku) . In order to make them available to me in the notebook, I need to install the modified library (which inside docker lives in /workspace). I am just doing this to help with working on the PR but you are right, this should not exist in the final version.

rnyak · 2023-06-05T12:41:28Z

examples/usecases/transformers-next-item-prediction-with-pretrained-embeddings.ipynb

@@ -0,0 +1,731 @@
+{


Line #1. generate_data('sigir-sku', 1000).head()
In the original The file sku_to_content.csv contains also category_hash feature. If you want to remove it may be you can add a note here?

Reply via ReviewNB

Yes, you are right! Well spotted, I was not aware I was missing it here, thank you for pointing it out. I brought it back (added it to schema.json for the synthetic data and also added it to the workflow for processing the actual data) but I have removed the image_vector and added a note as you advise since I don't believe we will be using it and rather will go with the description_vector which lends itself much better to this example (the image_vector is of length greater than 68k in the original data)

rnyak · 2023-06-13T17:26:13Z

examples/usecases/transformers-next-item-prediction-with-pretrained-embeddings.ipynb

@@ -0,0 +1,2198 @@
+{


read --> ready

Reply via ReviewNB

thank you, made the change!

rnyak · 2023-06-13T17:26:13Z

examples/usecases/transformers-next-item-prediction-with-pretrained-embeddings.ipynb

@@ -0,0 +1,2198 @@
+{


Line #1. input_block(batch)
I see below you are able to print input_block(inputs) ? or input_block(batch) is still not working?

Reply via ReviewNB

it is all good now! I think here the batch might have had the targets but when I use mm.sample_batch this works

bschifferer · 2023-06-19T07:21:09Z

examples/usecases/transformers-next-item-prediction-with-pretrained-embeddings.ipynb

@@ -0,0 +1,5658 @@
+{


Can we make an if-else statement using a boolean variable to indicate if we use synthetic or real data?

Reply via ReviewNB

change made! indeed, this is much more elegant with using a variable than commenting

bschifferer · 2023-06-19T07:21:10Z

examples/usecases/transformers-next-item-prediction-with-pretrained-embeddings.ipynb

@@ -0,0 +1,5658 @@
+{


Maybe we need a new headline here to differeniate between downloading and preparing the data and the NVT workflow?

Reply via ReviewNB

good point -- changed the structure and included additional headlines

bschifferer · 2023-06-19T07:21:10Z

examples/usecases/transformers-next-item-prediction-with-pretrained-embeddings.ipynb

@@ -0,0 +1,5658 @@
+{


I do not think we should use workflow outputs from NVTabular. This would not scale to a large dataset. I think we should store them to disk via to_parquet.

Users will adopt our examples to their dataset and it is likely they would run out of memory?

Reply via ReviewNB

change made

bschifferer · 2023-06-19T07:21:10Z

examples/usecases/transformers-next-item-prediction-with-pretrained-embeddings.ipynb

@@ -0,0 +1,5658 @@
+{


Line #2. wf = nvt.Workflow(out + 'description_vector')
Can you combine a list with a string?

Reply via ReviewNB

good question! 🙂 don't believe so, just tried that a second ago and Python does the sane thing (instead of converting the string to an array of chars or something like that) and throws an error

here however out is not a list but a <Node Categorify> so it all works

bschifferer · 2023-06-19T07:21:10Z

examples/usecases/transformers-next-item-prediction-with-pretrained-embeddings.ipynb

@@ -0,0 +1,5658 @@
+{


Maybe we can reference our other examples?

Reply via ReviewNB

linked to our NVTabular examples!

bschifferer · 2023-06-19T07:21:10Z

examples/usecases/transformers-next-item-prediction-with-pretrained-embeddings.ipynb

@@ -0,0 +1,5658 @@
+{


Line #14. batch_size=10,
If you run it on the full dataset, batch_size=10 will take a long time?

Reply via ReviewNB

yes, you are right! I changed this and added a couple of words around this so that people make a good use of the hardware that is available to them

bschifferer · 2023-06-19T07:21:10Z

examples/usecases/transformers-next-item-prediction-with-pretrained-embeddings.ipynb

@@ -0,0 +1,5658 @@
+{


Can you display the input_batch?

Reply via ReviewNB

I tried this in a live notebook and it actually doesn't look that bad, rows are getting hidden behind ...,will make a commit with the batch showing and will make a change if need be

bschifferer · 2023-06-19T07:21:10Z

examples/usecases/transformers-next-item-prediction-with-pretrained-embeddings.ipynb

@@ -0,0 +1,5658 @@
+{


Which workflow do you deploy here?

It seems that you expected the categoried input to Triton?

Reply via ReviewNB

yes, good point! 🙂 thanks for highlighting this! now working on a version that has this issue fixed

bschifferer · 2023-06-19T17:54:52Z

rerun tests

bschifferer · 2023-06-19T17:55:33Z

rerun tests

bschifferer · 2023-06-19T18:49:36Z

rerun tests

radekosmulski marked this pull request as draft May 19, 2023 10:02

rnyak added this to the Merlin 23.06 milestone May 22, 2023

rnyak requested review from bschifferer, marcromeyn and rnyak and removed request for marcromeyn May 22, 2023 15:38

radekosmulski force-pushed the add_sigir_dataset branch from fb6fc99 to 1e01e26 Compare June 2, 2023 11:42

radekosmulski added the examples label Jun 2, 2023

rnyak reviewed Jun 5, 2023

View reviewed changes

radekosmulski force-pushed the add_sigir_dataset branch 2 times, most recently from 252e07b to ad9fa2b Compare June 12, 2023 05:38

radekosmulski force-pushed the add_sigir_dataset branch from ad9fa2b to 6f45905 Compare June 13, 2023 02:36

rnyak reviewed Jun 13, 2023

View reviewed changes

radekosmulski force-pushed the add_sigir_dataset branch from b607011 to 045b9a8 Compare June 13, 2023 23:47

radekosmulski added 13 commits June 19, 2023 08:50

add SIGIR nb

7ea4156

update

c9a2023

WIP

eb21bb4

implement review changes

b24c4d6

update

37dd36c

fix issue

ea844c1

update WIP

726ee1d

update

abe2f4a

breaking changes

2c27e0a

clean up/create reproducer

4dd89e6

fix typo

07f5af3

update reproducer

00b9797

update example

c41aa0d

radekosmulski force-pushed the add_sigir_dataset branch from 045b9a8 to c41aa0d Compare June 19, 2023 00:14

radekosmulski marked this pull request as ready for review June 19, 2023 00:15

radekosmulski added 6 commits June 19, 2023 10:18

fix typo

ba61544

add test

dc451cb

clean up test

880c55c

update test

cd2a065

update test

1b5a6d9

fix issues

435a5f2

bschifferer reviewed Jun 19, 2023

View reviewed changes

radekosmulski changed the title ~~[WIP] add SIGIR nb~~ [add session-based example with pretrained embeddings Jun 19, 2023

radekosmulski added 2 commits June 19, 2023 21:49

implement review suggestions

00c142e

update test

5f4e99b

radekosmulski changed the title ~~[add session-based example with pretrained embeddings~~ add session-based example with pretrained embeddings Jun 19, 2023

fix typo

c2ee0b2

bschifferer approved these changes Jun 19, 2023

View reviewed changes

radekosmulski merged commit 7a0e221 into main Jun 19, 2023

add session-based example with pretrained embeddings #1102

add session-based example with pretrained embeddings #1102

Uh oh!

Conversation

radekosmulski commented May 19, 2023

Uh oh!

review-notebook-app bot commented May 19, 2023

Uh oh!

github-actions bot commented May 19, 2023

Documentation preview

Uh oh!

rnyak Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rnyak Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rnyak Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rnyak Jun 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rnyak Jun 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bschifferer Jun 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bschifferer Jun 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bschifferer Jun 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bschifferer Jun 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bschifferer Jun 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bschifferer Jun 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bschifferer Jun 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

rnyak Jun 5, 2023 •

edited

Loading

rnyak Jun 5, 2023 •

edited

Loading

rnyak Jun 5, 2023 •

edited

Loading

rnyak Jun 13, 2023 •

edited

Loading

rnyak Jun 13, 2023 •

edited

Loading

bschifferer Jun 19, 2023 •

edited

Loading

bschifferer Jun 19, 2023 •

edited

Loading

bschifferer Jun 19, 2023 •

edited

Loading

bschifferer Jun 19, 2023 •

edited

Loading

bschifferer Jun 19, 2023 •

edited

Loading

bschifferer Jun 19, 2023 •

edited

Loading

bschifferer Jun 19, 2023 •

edited

Loading

bschifferer Jun 19, 2023 •

edited

Loading