Skip to content

Conversation

@radekosmulski
Copy link
Contributor

A branch/PR for sharing code/facilitating dicussion on adding the SIGIR dataset

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@radekosmulski radekosmulski marked this pull request as draft May 19, 2023 10:02
@github-actions
Copy link

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-1102

@rnyak rnyak added this to the Merlin 23.06 milestone May 22, 2023
@rnyak rnyak requested review from bschifferer, marcromeyn and rnyak and removed request for marcromeyn May 22, 2023 15:38
@@ -0,0 +1,731 @@
{
Copy link
Contributor

@rnyak rnyak Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #4.    

I think we should not have two workflows.. we need to merge these two csv files first (you can do it with cudf, or pandas upfront, before NVT), and then apply categorify and other ops.. otherwise it will be wrongly mapped.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way you describe it would be a good way to do it, but because of the high cardinality of the pretrained embeddings (50 in this case, but customers might likely encounter even larger sizes) our data would expand very significantly (we would need to add this information onto the 36+ million rows in the dataset).

This is only a WIP, I just wanted to see if I would be able to generate this dataset from a schema (was mostly concerned about the list columns that contain pretrained embedding information).

From what I understand that functionality is still in the works, but what we might be able to do is provide the train set along with information for pretrained embeddings separately to nvtabular/dataloader and the dataloder will be able to do the linking.

And sorry, I just reread your comment - yes, you are spot on, that is a great observation! Yes, if this information were to be provided separately to two workflows, indeed the mapping after Categorify would be broken. But I don't believe we have the machinery available as of now to support providing both at the same time to Categorify (unless we do the merge like you describe, but this is not what we probably will want to do in the final example, once the functionality is fully implemented).

@@ -0,0 +1,731 @@
{
Copy link
Contributor

@rnyak rnyak Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #3.    cd /workspace && pip install . 

why do we need to do pip install here?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am modifying merlin-models to include the schema for the new datasets (sigir-browsing and sigir-sku) . In order to make them available to me in the notebook, I need to install the modified library (which inside docker lives in /workspace). I am just doing this to help with working on the PR but you are right, this should not exist in the final version.

@@ -0,0 +1,731 @@
{
Copy link
Contributor

@rnyak rnyak Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    generate_data('sigir-sku', 1000).head()

In the original The file sku_to_content.csv contains also category_hash feature. If you want to remove it may be you can add a note here?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right! Well spotted, I was not aware I was missing it here, thank you for pointing it out. I brought it back (added it to schema.json for the synthetic data and also added it to the workflow for processing the actual data) but I have removed the image_vector and added a note as you advise since I don't believe we will be using it and rather will go with the description_vector which lends itself much better to this example (the image_vector is of length greater than 68k in the original data)

@radekosmulski radekosmulski force-pushed the add_sigir_dataset branch 2 times, most recently from 252e07b to ad9fa2b Compare June 12, 2023 05:38
@@ -0,0 +1,2198 @@
{
Copy link
Contributor

@rnyak rnyak Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read --> ready


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, made the change!

@@ -0,0 +1,2198 @@
{
Copy link
Contributor

@rnyak rnyak Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    input_block(batch)

I see below you are able to print input_block(inputs) ? or input_block(batch) is still not working?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is all good now! I think here the batch might have had the targets but when I use mm.sample_batch this works

@radekosmulski radekosmulski marked this pull request as ready for review June 19, 2023 00:15
@@ -0,0 +1,5658 @@
{
Copy link
Contributor

@bschifferer bschifferer Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make an if-else statement using a boolean variable to indicate if we use synthetic or real data?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change made! indeed, this is much more elegant with using a variable than commenting

@@ -0,0 +1,5658 @@
{
Copy link
Contributor

@bschifferer bschifferer Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need a new headline here to differeniate between downloading and preparing the data and the NVT workflow?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point -- changed the structure and included additional headlines

@@ -0,0 +1,5658 @@
{
Copy link
Contributor

@bschifferer bschifferer Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we should use workflow outputs from NVTabular. This would not scale to a large dataset. I think we should store them to disk via to_parquet.

Users will adopt our examples to their dataset and it is likely they would run out of memory?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change made

@@ -0,0 +1,5658 @@
{
Copy link
Contributor

@bschifferer bschifferer Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #2.    wf = nvt.Workflow(out + 'description_vector')

Can you combine a list with a string?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question! 🙂 don't believe so, just tried that a second ago and Python does the sane thing (instead of converting the string to an array of chars or something like that) and throws an error

here however out is not a list but a <Node Categorify> so it all works

@@ -0,0 +1,5658 @@
{
Copy link
Contributor

@bschifferer bschifferer Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can reference our other examples?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linked to our NVTabular examples!

@@ -0,0 +1,5658 @@
{
Copy link
Contributor

@bschifferer bschifferer Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #14.        batch_size=10,

If you run it on the full dataset, batch_size=10 will take a long time?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are right! I changed this and added a couple of words around this so that people make a good use of the hardware that is available to them

@@ -0,0 +1,5658 @@
{
Copy link
Contributor

@bschifferer bschifferer Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you display the input_batch?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this in a live notebook and it actually doesn't look that bad, rows are getting hidden behind ...,will make a commit with the batch showing and will make a change if need be

@@ -0,0 +1,5658 @@
{
Copy link
Contributor

@bschifferer bschifferer Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which workflow do you deploy here?

It seems that you expected the categoried input to Triton?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good point! 🙂 thanks for highlighting this! now working on a version that has this issue fixed

@radekosmulski radekosmulski changed the title [WIP] add SIGIR nb [add session-based example with pretrained embeddings Jun 19, 2023
@radekosmulski radekosmulski changed the title [add session-based example with pretrained embeddings add session-based example with pretrained embeddings Jun 19, 2023
@bschifferer
Copy link
Contributor

rerun tests

@bschifferer
Copy link
Contributor

rerun tests

1 similar comment
@bschifferer
Copy link
Contributor

rerun tests

@radekosmulski radekosmulski merged commit 7a0e221 into main Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants