|
3 | 3 | This repository provides the code to construct the `HORIZON benchmark` — a large-scale, cross-domain benchmark built by refactoring the popular **Amazon-Reviews 2023 datase**t for evaluating sequential recommendation and user behavior modeling. |
4 | 4 | We do not release any new data; instead, we share reproducible scripts and guidelines to regenerate the benchmark, enabling rigorous evaluation of generalization across time, unseen users, and long user histories. The benchmark supports modern research needs by focusing on temporal robustness, out-of-distribution generalization, and long-horizon user modeling beyond next-item prediction. |
5 | 5 |
|
6 | | -## 1. Transparency |
7 | | - |
8 | | -### Overview |
| 6 | +## Overview |
9 | 7 |
|
10 | 8 | HORIZON is a benchmark for in-the-wild user modeling in the e-commerce domain. This repository provides the necessary code to load a publicly available dataset, process it to create a benchmark, and then run a diverse set of user modeling algorithms on the benchmark. The publicly available dataset was collected from amazon.com, likely representing users from the United States. |
11 | 9 |
|
12 | | -### Objective |
| 10 | +## Objective |
13 | 11 |
|
14 | 12 | Our objective is to provide a standardized testbed for user modeling. |
15 | 13 |
|
16 | | -### Audience |
| 14 | +## Audience |
17 | 15 |
|
18 | 16 | HORIZON benchmark is intended for researchers, AI practitioners, and industry professionals who are interested in evaluating user modeling algorithms. |
19 | 17 |
|
20 | | -### Intended Uses |
| 18 | +## Intended Uses |
21 | 19 |
|
22 | 20 | HORIZON benchmark can be used as a standardized evaluation platform to evaluate performance of both existing and new algorithms. Our results may be most useful for settings involving products in similar categories to the dataset we used. For a list of these 33 categories, see https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/blob/main/all_categories.txt. |
23 | 21 |
|
24 | | -### Out of Scope Uses |
| 22 | +## Out of Scope Uses |
25 | 23 |
|
26 | 24 | HORIZON benchmark is not intended to be used to circumvent any policies adopted by LLM providers. |
27 | 25 |
|
28 | 26 | The user modeling algorithms provided in HORIZON are for e-commerce product scenarios only and may not translate to other kinds of products or buying behavior. |
29 | 27 |
|
30 | | -### Evaluation |
| 28 | +## Evaluation |
31 | 29 |
|
32 | 30 | We have evaluated many state-of-the-art algorithms on the HORIZON benchmark. For details, please refer to the accompanying [Arxiv paper](TBD). |
33 | 31 |
|
34 | | -### Limitations |
| 32 | +## Limitations |
35 | 33 |
|
36 | 34 | - HORIZON provides an offline evaluation. In real-world applications, offline evaluation results may differ from online evaluation that involves deploying a user modeling algorithm. |
37 | 35 |
|
38 | 36 | - HORIZON benchmark contains only items in English language. |
39 | 37 |
|
40 | 38 | - The accuracy of HORIZON evaluation metrics for a real-world application depends on the diversity and representativeness of the underlying data. |
41 | 39 |
|
42 | | -### Usage |
| 40 | +## Usage |
43 | 41 |
|
44 | 42 | This project is primarily designed for research and experimental purposes. We strongly recommend conducting further testing and validation before considering its application in industrial or real-world scenarios. |
45 | 43 |
|
46 | | -### Feedback and Collaboration |
| 44 | +## Feedback and Collaboration |
47 | 45 |
|
48 | 46 | We welcome feedback and collaboration from our audience. If you have suggestions, questions, or would like to contribute to the project, please feel free to raise an [issue](https://github.com/microsoft/horizon-benchmark/issues) or add a [pull request](https://github.com/microsoft/horizon-benchmark/pulls) |
49 | 47 |
|
50 | 48 | --- |
51 | | -## 2. HORIZON Benchmark Construction |
| 49 | +## HORIZON Benchmark Construction |
52 | 50 |
|
53 | 51 | ### a. Curating the Full Dataset: |
54 | 52 | The scripts for constructing the `HORIZON` benchmark are provided in the `data` folder. Follow the following steps to reproduce the benchmark: |
|
0 commit comments