This project aims to build a video recommendation system using the KuaiRec dataset, which contains rich interaction data between users and short videos. The model's goal is to predict the watch ratio of users on specific videos.
The dataset is large and detailed:
- 1.5 GB interaction matrix for training.
- 10,728 videos
- 7,176 users
Key observations:
- The
items.csvfile contains a wide variety of features for each video (e.g., tags, duration, views, likes). - The
users.csvfile already includes one-hot encoded categorical features. - Memory management is critical due to the dataset size.
To prepare the data:
- Watch Ratio:
- Removed outliers with watch ratios > 3.
- Applied Min-Max normalization (to scale between 0 and 1).
- Other Numerical Features:
- Applied outlier filtering.
- Used log scaling or Min-Max normalization.
- Categorical and One-Hot Vectors:
- Padded user one-hot features (some had
NaNvalues). - Padded video tag sequences to length 31.
- Padded user one-hot features (some had
- Used the existing one-hot vectors.
- Did not include followers/followings for simplicity.
- Included:
- Duration
- Author ID
- Average watch ratio
- Tags (IDs and strings)
- Excluded:
- Views and likes (to avoid bias toward popularity).
Some additional features were left out intentionally for future experimentation with weighting or additional sub-networks.
| Model | Result |
|---|---|
| ALS (Spark) | Didn't scale well โ too large. |
| Linear Regression | Poor performance (baseline). |
| Two-Tower Model | โ Best results so far. |
- Architecture: separate neural nets for users and videos.
- Output: predicted normalized watch ratio.
- Results:
- Loss:
0.02 - MAE:
0.12
- Loss:
Planned steps to enhance the recommender:
- โ Get ALS / matrix factorization working properly.
- โ
Extract embeddings from:
- Two-tower model
- Matrix factorization
- ๐ Combine them into a hybrid model (e.g., late fusion or embedding concat).
- ๐ฌ Explore a transformer-based model that:
- Encodes video sequences with positional encoding.
- Uses attention mechanisms for sequential recommendation.
- โก Very large datasets (memory & performance issues).
- ๐ High number of outliers in interaction values.
- ๐งฎ Limited compute power.
- ๐ง Beginner mistakes (e.g., forgot feature normalization at first).
- ๐คฏ Feature overload โ selecting relevant ones took time.
Despite the challenges, this project was a great opportunity to learn about recommender systems, machine learning pipelines, and handling real-world-scale data.
If you have any questions or feedback, feel free to open an issue or contact me on GitHub!