Benchamrk framework for torchrec #3072

lizhouyu · 2025-06-09T21:30:18Z

Summary:
Benchmark framework for MPZCH

Rollback Plan:

Differential Revision: D76150895

facebook-github-bot · 2025-06-09T21:30:25Z

This pull request was exported from Phabricator. Differential Revision: D76150895

Summary: Pull Request resolved: pytorch#3017 ### Major changes - Copy the following files from `fb` to corresponding location in the `torchrec` repository - `fb/distributed/hash_mc_embedding.py → torchrec/distributed/hash_mc_embedding.py` - `fb/modules/hash_mc_evictions.py → torchrec/modules/hash_mc_evictions.py` - `fb/modules/hash_mc_metrics.py → torchrec/modules/hash_mc_metrics.py` - `fb/modules/hash_mc_modules.py → torchrec/modules/hash_mc_modules.py` - `fb/modules/tests/test_hash_mc_evictions.py → torchrec/modules/tests/test_hash_mc_evictions.py` - `fb/modules/tests/test_hash_mc_modules.py → torchrec/modules/tests/test_hash_mc_modules.py` - Create a `test_hash_zch_mc.py` file in `torchrec/distributed/tests` folder following the `test_quant_mc_embedding.py` in `torchrec/fb/distributed/tests`. - trimmed quantization and inference codes, and only kept the training part. - rewire the related packages from `torchrec.fb` to `torchrec` - Update `BUCK` files in related folders - Update the affected repos to use `torchrec` modules instead of the modules in `torchrec.fb` - Update `/modules/hash_mc_metrics.py` - Replace the tensorboard module with a local file logger in `hash_mc_metrics.py` module - Update the license declaration headers for the four OSS files ### ToDos after landing this Diff - Clean the duplicated `hash_mc_modules.py` file in the `fb` folder for safe landing. Differential Revision: D76476676

Summary: ### Major changes - Create a `mpzch` folder under the `torchrec/github/examples` folder - Implement a simple SparseArch module with a flag to switch between original and MPZCH managed collision modules - Profile the running time and QPS for model training(GPU)/inference(CPU) - Create a notebook tutorial for ZCH basics and the use of ZCH modules in TorchRec ### ToDos for OSS - When the internal torchrec MPZCH module is OSS - Remove the `BUCK` file - Replace all the `from torchrec.fb.modules` in `sparse_arch.py` to `from torchrec.modules` ### Potential improvement - Add hash collision counter - Show profiling results in the Readme file - Add multi-batch profiling Differential Revision: D75570684 Reviewed By: aporialiao

facebook-github-bot · 2025-06-16T19:05:27Z

This pull request was exported from Phabricator. Differential Revision: D76150895

facebook-github-bot · 2025-06-16T20:56:09Z

This pull request was exported from Phabricator. Differential Revision: D76150895

Summary: Pull Request resolved: pytorch#3072 # Benchmark framework for MPZCH ### Major changes - Add a `benchmark prober` in `torchrec/distributed/benchmark/benchmark_zch_utils.py` to collect and calculate the zero collision hash related metrics like hit count, insert count, and collision count. - Implement a `benchmark_zch_dlrmv2` local testbed in `torchrec/distributed/benchmark/benchmark_zch_dlrmv2.py`, which allows to profile a DLRMv2 model with and without the MPZCH enabled, and record the metrics including ZCH-related metrics, QPS, NE, and AUROC. - Add `mc_adapter` modules in `torchrec/modules/mc_adapter.py`. These modules enable seamless replacement of embedding collection and embedding bag collection modules with the managed collision version. - Add two dictionaries `self.table_name_on_device_remapped_ids_dict` and `self.table_name_on_device_input_ids_dict` in the `HashZchManagedCollisionModule` module in `torchrec/modules/hash_mc_modules.py` to record the remapped identities and input feature values to the MPZCH module on current rank respectively after input mapping. - Add `count_non_zch_collision.py` script to count the collision rate of non-zch modules after performing `murmur_hash3`. - Add the criteo kaggle dataset data loader in `torchrec/distributed/benchmark/data` and revise the `hashes` attribute of data pipeline in the `_get_in_memory_dataloader` function in the `torchrec/distributed/benchmark/data/dlrm_dataloader.py` file to pre-hash the input feature values to the passed-in argument `input_hash_size` (defaultly as 100000). - Note that we can change the `single_ttl` in the `HashZchSingleTtlScorer` module of `torchrec/modules/hash_mc_evictions.py` to change the eviticability of identities in each `HashZchManagedCollisionModule` module, since exiting benchmark workflow only takes several minutes on the subset of the criteo-kaggle dataset. By default the identities become evictable after one hour. This descrepency leads to non-eviction during the profiling process. ### Dataset - [Criteo Kaggle Small](https://drive.google.com/file/d/1__rPcUSa45FHkmnBwivuM7K4nMYWD7b7/view?usp=sharing) - [Criteo Kaggle](https://drive.google.com/file/d/1_lAbXTEOk5vlPGXd4UvTrxGV6sCPer_R/view?usp=drive_link) Differential Revision: D76150895

facebook-github-bot · 2025-06-16T21:04:10Z

This pull request was exported from Phabricator. Differential Revision: D76150895

Summary: Pull Request resolved: pytorch#3072 # Benchmark framework for MPZCH ### Major changes - Add a `benchmark prober` in `torchrec/distributed/benchmark/benchmark_zch_utils.py` to collect and calculate the zero collision hash related metrics like hit count, insert count, and collision count. - Implement a `benchmark_zch_dlrmv2` local testbed in `torchrec/distributed/benchmark/benchmark_zch_dlrmv2.py`, which allows to profile a DLRMv2 model with and without the MPZCH enabled, and record the metrics including ZCH-related metrics, QPS, NE, and AUROC. - Add `mc_adapter` modules in `torchrec/modules/mc_adapter.py`. These modules enable seamless replacement of embedding collection and embedding bag collection modules with the managed collision version. - Add two dictionaries `self.table_name_on_device_remapped_ids_dict` and `self.table_name_on_device_input_ids_dict` in the `HashZchManagedCollisionModule` module in `torchrec/modules/hash_mc_modules.py` to record the remapped identities and input feature values to the MPZCH module on current rank respectively after input mapping. - Add `count_non_zch_collision.py` script to count the collision rate of non-zch modules after performing `murmur_hash3`. - Add the criteo kaggle dataset data loader in `torchrec/distributed/benchmark/data` and revise the `hashes` attribute of data pipeline in the `_get_in_memory_dataloader` function in the `torchrec/distributed/benchmark/data/dlrm_dataloader.py` file to pre-hash the input feature values to the passed-in argument `input_hash_size` (defaultly as 100000). - Note that we can change the `single_ttl` in the `HashZchSingleTtlScorer` module of `torchrec/modules/hash_mc_evictions.py` to change the eviticability of identities in each `HashZchManagedCollisionModule` module, since exiting benchmark workflow only takes several minutes on the subset of the criteo-kaggle dataset. By default the identities become evictable after one hour. This descrepency leads to non-eviction during the profiling process. ### Dataset - [Criteo Kaggle Small](https://drive.google.com/file/d/1__rPcUSa45FHkmnBwivuM7K4nMYWD7b7/view?usp=sharing) - [Criteo Kaggle](https://drive.google.com/file/d/1_lAbXTEOk5vlPGXd4UvTrxGV6sCPer_R/view?usp=drive_link) Differential Revision: D76150895

facebook-github-bot · 2025-06-16T21:24:40Z

This pull request was exported from Phabricator. Differential Revision: D76150895

Summary: Pull Request resolved: pytorch#3072 # Benchmark framework for MPZCH ### Major changes - Add a `benchmark prober` in `torchrec/distributed/benchmark/benchmark_zch_utils.py` to collect and calculate the zero collision hash related metrics like hit count, insert count, and collision count. - Implement a `benchmark_zch_dlrmv2` local testbed in `torchrec/distributed/benchmark/benchmark_zch_dlrmv2.py`, which allows to profile a DLRMv2 model with and without the MPZCH enabled, and record the metrics including ZCH-related metrics, QPS, NE, and AUROC. - Add `mc_adapter` modules in `torchrec/modules/mc_adapter.py`. These modules enable seamless replacement of embedding collection and embedding bag collection modules with the managed collision version. - Add two dictionaries `self.table_name_on_device_remapped_ids_dict` and `self.table_name_on_device_input_ids_dict` in the `HashZchManagedCollisionModule` module in `torchrec/modules/hash_mc_modules.py` to record the remapped identities and input feature values to the MPZCH module on current rank respectively after input mapping. - Add `count_non_zch_collision.py` script to count the collision rate of non-zch modules after performing `murmur_hash3`. - Add the criteo kaggle dataset data loader in `torchrec/distributed/benchmark/data` and revise the `hashes` attribute of data pipeline in the `_get_in_memory_dataloader` function in the `torchrec/distributed/benchmark/data/dlrm_dataloader.py` file to pre-hash the input feature values to the passed-in argument `input_hash_size` (defaultly as 100000). - Note that we can change the `single_ttl` in the `HashZchSingleTtlScorer` module of `torchrec/modules/hash_mc_evictions.py` to change the eviticability of identities in each `HashZchManagedCollisionModule` module, since exiting benchmark workflow only takes several minutes on the subset of the criteo-kaggle dataset. By default the identities become evictable after one hour. This descrepency leads to non-eviction during the profiling process. ### Dataset - [Criteo Kaggle Small](https://drive.google.com/file/d/1__rPcUSa45FHkmnBwivuM7K4nMYWD7b7/view?usp=sharing) - [Criteo Kaggle](https://drive.google.com/file/d/1_lAbXTEOk5vlPGXd4UvTrxGV6sCPer_R/view?usp=drive_link) Differential Revision: D76150895

facebook-github-bot · 2025-06-16T21:44:25Z

This pull request was exported from Phabricator. Differential Revision: D76150895

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2025

facebook-github-bot added the fb-exported label Jun 9, 2025

lizhouyu added 2 commits June 16, 2025 10:50

lizhouyu force-pushed the export-D76150895 branch from f06a2f1 to 53a0136 Compare June 16, 2025 19:05

lizhouyu force-pushed the export-D76150895 branch from 53a0136 to f815ed2 Compare June 16, 2025 20:56

lizhouyu force-pushed the export-D76150895 branch from f815ed2 to bfd8143 Compare June 16, 2025 21:04

lizhouyu force-pushed the export-D76150895 branch from bfd8143 to 8036d23 Compare June 16, 2025 21:24

lizhouyu force-pushed the export-D76150895 branch from 8036d23 to 2de3bb3 Compare June 16, 2025 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchamrk framework for torchrec #3072

Benchamrk framework for torchrec #3072

Uh oh!

lizhouyu commented Jun 9, 2025

Uh oh!

facebook-github-bot commented Jun 9, 2025

Uh oh!

facebook-github-bot commented Jun 16, 2025

Uh oh!

facebook-github-bot commented Jun 16, 2025

Uh oh!

facebook-github-bot commented Jun 16, 2025

Uh oh!

facebook-github-bot commented Jun 16, 2025

Uh oh!

facebook-github-bot commented Jun 16, 2025

Uh oh!

Uh oh!

Benchamrk framework for torchrec #3072

Are you sure you want to change the base?

Benchamrk framework for torchrec #3072

Uh oh!

Conversation

lizhouyu commented Jun 9, 2025

Uh oh!

facebook-github-bot commented Jun 9, 2025

Uh oh!

facebook-github-bot commented Jun 16, 2025

Uh oh!

facebook-github-bot commented Jun 16, 2025

Uh oh!

facebook-github-bot commented Jun 16, 2025

Uh oh!

facebook-github-bot commented Jun 16, 2025

Uh oh!

facebook-github-bot commented Jun 16, 2025

Uh oh!

Uh oh!