This repository provides the code to construct the HORIZON benchmark — a large-scale, cross-domain benchmark built by refactoring the popular Amazon-Reviews 2023 dataset for evaluating sequential recommendation and user behavior modeling.
We do not release any new data; instead, we share reproducible scripts and guidelines to regenerate the benchmark, enabling rigorous evaluation of generalization across time, unseen users, and long user histories. The benchmark supports modern research needs by focusing on temporal robustness, out-of-distribution generalization, and long-horizon user modeling beyond next-item prediction.
HORIZON is a benchmark for in-the-wild user modeling in the e-commerce domain. This repository provides the necessary code to load a publicly available dataset, process it to create a benchmark, and then run a diverse set of user modeling algorithms on the benchmark. The publicly available dataset was collected from amazon.com, likely representing users from the United States.
Our objective is to provide a standardized testbed for user modeling.
HORIZON benchmark is intended for researchers, AI practitioners, and industry professionals who are interested in evaluating user modeling algorithms.
HORIZON benchmark can be used as a standardized evaluation platform to evaluate performance of both existing and new algorithms. Our results may be most useful for settings involving products in similar categories to the dataset we used. For a list of these 33 categories, see https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/blob/main/all_categories.txt.
HORIZON benchmark is not intended to be used to circumvent any policies adopted by LLM providers.
The user modeling algorithms provided in HORIZON are for e-commerce product scenarios only and may not translate to other kinds of products or buying behavior.
We have evaluated many state-of-the-art algorithms on the HORIZON benchmark. For details, please refer to the accompanying Arxiv paper.
-
HORIZON provides an offline evaluation. In real-world applications, offline evaluation results may differ from online evaluation that involves deploying a user modeling algorithm.
-
HORIZON benchmark contains only items in English language.
-
The accuracy of HORIZON evaluation metrics for a real-world application depends on the diversity and representativeness of the underlying data.
This project is primarily designed for research and experimental purposes. We strongly recommend conducting further testing and validation before considering its application in industrial or real-world scenarios.
We welcome feedback and collaboration from our audience. If you have suggestions, questions, or would like to contribute to the project, please feel free to raise an issue or add a pull request
The scripts for constructing the HORIZON benchmark are provided in the data folder. Follow the following steps to reproduce the benchmark:
- Clone the repository:
git clone <REPO_NAME>
cd Horizon-Benchmark- Create and activate the environment from the YAML file:
conda env create -f environment.yaml
conda activate <your-env-name>- Give necessary permissions to the bash scripts:
chmod +x running_phase_one.sh
chmod +x running_phase_two.sh
chmod +x running_phase_three_metadata.sh- Run the bash scripts to curate the dataset files:
./running_phase_one.sh
./running_phase_two.sh
./running_phase_three_metadata.shSummary of the Bash Scripts:
- The process is extremely RAM-intensive due to the massive size of the corpus being created and the multiprocessing/batching optimizations performed to make it efficient. If your system doesnt support it, consider tweaking the hyperparameters.
- Phase One is the process of retrieving the category-wise data from the Amazon Reviews 2023 open-source repository and storing it in category-wise parquet files
- Phase Two is the process of merging these category-wise data to get a final
merged_user_all.jsonfile which contains the merged / category-agnostic user history of all the users in the Benchmark. - Phase Three is the process of curating the metadata from Amazon Reviews and storing it in a
JSON,ParquetandDBfile. Necessary post-proc like filtering missing users/events and removing review texts is done to get lighter versions of the benchmark. - At the end of Phase Three, the following files shall be created:
amazon_parquet_data/metadata_titles.db: SQLite database containing 3 columns i.e. (1) Product ASIN, (2) Product title and (3) Product category for all items in the catalogamazon_parquet_data/merged_users_all_final_filtered.json: Cleaned final full data JSON file with all users (removing those users with 0 lengths or missing titles in the metadata). The structure of this final JSON is as follows:{ "{user_id}": { "{history}" : [I_1, I_2, .... , I_T], "{timestamps}" : [t_1, t_2, .... , t_T], "{ratings}" : [r_1, r_2, .... , r_T], "{reviews}" : [review_1, review_2, .... , review_T], }amazon_parquet_data/merged_users_all_final_filtered_no_reviews.json: Cleaned final full data JSON file with all users (removing those users with 0 lengths or missing titles in the metadata) and no reviews. We provide a lighter version of the full data for those who do not plan to use thereviewsfield in their study.
The script for generating the splits as described in the paper are shared in the splits folder. Follow the steps below to generate the splits:
- Generate user IDs for the 4 splits (as described in the paper) and save them in a txt file. You would need the
amazon_parquet_data/merged_users_all_final_filtered_no_reviews.jsonfile from the previous steps for this. A random seed of42is set for the sampling and the temporal thresholds for validation is set at2019and for test at2020:
python3 prepare_split_ids_full.pyThis generates 4 txt files corresponding to in-distribution and out-of-distribution validation and test set users.
- Populate the JSON files with the complete data of each corresponding split in the same format as the JSON shared before:
python3 write_splits_to_jsons_full.pyThis generates 4 JSON files corresponding to the in-distribution and out-of-distribution validation and test set users.