-
Notifications
You must be signed in to change notification settings - Fork 53
add session-based example with pretrained embeddings #1102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Documentation preview |
fb6fc99 to
1e01e26
Compare
| @@ -0,0 +1,731 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #4.
I think we should not have two workflows.. we need to merge these two csv files first (you can do it with cudf, or pandas upfront, before NVT), and then apply categorify and other ops.. otherwise it will be wrongly mapped.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way you describe it would be a good way to do it, but because of the high cardinality of the pretrained embeddings (50 in this case, but customers might likely encounter even larger sizes) our data would expand very significantly (we would need to add this information onto the 36+ million rows in the dataset).
This is only a WIP, I just wanted to see if I would be able to generate this dataset from a schema (was mostly concerned about the list columns that contain pretrained embedding information).
From what I understand that functionality is still in the works, but what we might be able to do is provide the train set along with information for pretrained embeddings separately to nvtabular/dataloader and the dataloder will be able to do the linking.
And sorry, I just reread your comment - yes, you are spot on, that is a great observation! Yes, if this information were to be provided separately to two workflows, indeed the mapping after Categorify would be broken. But I don't believe we have the machinery available as of now to support providing both at the same time to Categorify (unless we do the merge like you describe, but this is not what we probably will want to do in the final example, once the functionality is fully implemented).
| @@ -0,0 +1,731 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am modifying merlin-models to include the schema for the new datasets (sigir-browsing and sigir-sku) . In order to make them available to me in the notebook, I need to install the modified library (which inside docker lives in /workspace). I am just doing this to help with working on the PR but you are right, this should not exist in the final version.
| @@ -0,0 +1,731 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #1. generate_data('sigir-sku', 1000).head()
In the original The file sku_to_content.csv contains also category_hash feature. If you want to remove it may be you can add a note here?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right! Well spotted, I was not aware I was missing it here, thank you for pointing it out. I brought it back (added it to schema.json for the synthetic data and also added it to the workflow for processing the actual data) but I have removed the image_vector and added a note as you advise since I don't believe we will be using it and rather will go with the description_vector which lends itself much better to this example (the image_vector is of length greater than 68k in the original data)
252e07b to
ad9fa2b
Compare
ad9fa2b to
6f45905
Compare
| @@ -0,0 +1,2198 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you, made the change!
| @@ -0,0 +1,2198 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #1. input_block(batch)
I see below you are able to print input_block(inputs) ? or input_block(batch) is still not working?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is all good now! I think here the batch might have had the targets but when I use mm.sample_batch this works
b607011 to
045b9a8
Compare
045b9a8 to
c41aa0d
Compare
| @@ -0,0 +1,5658 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make an if-else statement using a boolean variable to indicate if we use synthetic or real data?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change made! indeed, this is much more elegant with using a variable than commenting
| @@ -0,0 +1,5658 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need a new headline here to differeniate between downloading and preparing the data and the NVT workflow?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point -- changed the structure and included additional headlines
| @@ -0,0 +1,5658 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we should use workflow outputs from NVTabular. This would not scale to a large dataset. I think we should store them to disk via to_parquet.
Users will adopt our examples to their dataset and it is likely they would run out of memory?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change made
| @@ -0,0 +1,5658 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #2. wf = nvt.Workflow(out + 'description_vector')
Can you combine a list with a string?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question! 🙂 don't believe so, just tried that a second ago and Python does the sane thing (instead of converting the string to an array of chars or something like that) and throws an error
here however out is not a list but a <Node Categorify> so it all works
| @@ -0,0 +1,5658 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
linked to our NVTabular examples!
| @@ -0,0 +1,5658 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #14. batch_size=10,
If you run it on the full dataset, batch_size=10 will take a long time?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you are right! I changed this and added a couple of words around this so that people make a good use of the hardware that is available to them
| @@ -0,0 +1,5658 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this in a live notebook and it actually doesn't look that bad, rows are getting hidden behind ...,will make a commit with the batch showing and will make a change if need be
| @@ -0,0 +1,5658 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which workflow do you deploy here?
It seems that you expected the categoried input to Triton?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, good point! 🙂 thanks for highlighting this! now working on a version that has this issue fixed
|
rerun tests |
|
rerun tests |
1 similar comment
|
rerun tests |
A branch/PR for sharing code/facilitating dicussion on adding the SIGIR dataset