Skip to content

Conversation

@1985312383
Copy link
Collaborator

Pull Request / 拉取请求

What does this PR do? / 这个PR做了什么?

本 PR 实现了 HSTU (Hierarchical Sequential Transduction Units) 生成式推荐模型,这是 Meta 提出的一种先进的序列推荐模型。主要包括:

  1. 完整的 HSTU 模型实现

    • 核心层 HSTULayer:多头注意力 + 门控机制 + FFN
    • 模块化 HSTUBlock:多层 HSTU 转导单元堆叠
    • 完整的 HSTUModel:支持时间感知的自回归生成式推荐
  2. 时间感知的数据预处理流程

    • MovieLens-1M 专用预处理器,采用滑动窗口策略
    • 时间差特征提取与 bucket 化
    • 冷启动过滤与数据增强
  3. 训练与评估框架

    • 通用序列生成训练器 SeqTrainer
    • 支持 HR@K、NDCG@K 等排序指标
    • 完整的端到端示例脚本
  4. 工具函数与文档

    • 相对位置偏置 RelPosBias
    • 词表映射与掩码工具 VocabMapperVocabMask
    • 详细的中文文档与复现说明

This PR implements the HSTU (Hierarchical Sequential Transduction Units) generative recommendation model proposed by Meta. Key features include:

  1. Complete HSTU model implementation

    • Core HSTULayer: Multi-head attention + gating + FFN
    • Modular HSTUBlock: Stacked HSTU transduction units
    • Full HSTUModel: Time-aware autoregressive generative recommender
  2. Time-aware data preprocessing pipeline

    • MovieLens-1M preprocessor with sliding window strategy
    • Time-difference feature extraction and bucketing
    • Cold-start filtering and data augmentation
  3. Training and evaluation framework

    • Generic sequence generation trainer SeqTrainer
    • Support for HR@K, NDCG@K ranking metrics
    • Complete end-to-end example scripts
  4. Utilities and documentation

    • Relative position bias RelPosBias
    • Vocabulary mapping and masking tools
    • Comprehensive Chinese documentation

Type of Change / 变更类型

  • 🐛 Bug fix / Bug修复
  • ✨ New model/feature / 新模型/功能
  • 📝 Documentation / 文档
  • 🔧 Maintenance / 维护

Related Issues / 相关Issues

N/A - 这是一个新功能实现

Key Implementation Details / 关键实现细节

1. 模型架构 / Model Architecture

  • HSTULayer (torch_rechub/basic/layers.py)

    • 单层 HSTU 转导单元,包含多头自注意力、门控机制和前馈网络
    • 支持因果掩码和相对位置偏置
    • 完全基于 PyTorch 标准算子实现
  • HSTUModel (torch_rechub/models/generative/hstu.py)

    • 完整的自回归生成式推荐模型
    • 支持 token embedding、position embedding 和 time embedding
    • 可配置的模型规模(d_model, n_heads, n_layers)

2. 数据处理 / Data Processing

  • 滑动窗口策略 (examples/generative/data/ml-1m/preprocess_ml_hstu.py)
    • 对每个用户序列生成多个训练样本,提高数据利用率
    • 计算相邻交互的时间差特征
    • 支持序列长度控制和冷启动过滤

3. 训练框架 / Training Framework

  • SeqTrainer (torch_rechub/trainers/seq_trainer.py)
    • 通用序列生成训练器,支持 HSTU、GRU4Rec、SASRec 等模型
    • 使用 CrossEntropyLoss 进行下一个 item 预测
    • 支持早停、模型保存和评估指标计算

4. 与官方实现的对比 / Comparison with Official Implementation

  • 相同点

    • 核心架构与 Meta 官方 HSTU 一致
    • 支持时间感知建模
    • 使用相对位置偏置
  • 差异点

    • 使用纯 PyTorch 实现,未引入自定义 CUDA kernel
    • 采用显式滑动窗口策略,适配中等规模数据集
    • 使用 full softmax 而非 sampled softmax(可扩展)
    • 面向研究与教学场景,优先保证可读性

How to Test / 如何测试

1. 数据预处理 / Data Preprocessing

cd examples/generative/data/ml-1m
python preprocess_ml_hstu.py

2. 训练模型 / Train Model

from torch_rechub.models.generative.hstu import HSTUModel
from torch_rechub.trainers.seq_trainer import SeqTrainer
import torch

# 创建模型
model = HSTUModel(
    vocab_size=3706,  # MovieLens-1M item count
    d_model=256,
    n_heads=8,
    n_layers=2,
    max_seq_len=200
)

# 创建训练器
trainer = SeqTrainer(
    model=model,
    optimizer_fn=torch.optim.Adam,
    device='cuda' if torch.cuda.is_available() else 'cpu'
)

# 训练
trainer.fit(train_loader, val_loader, n_epoch=10)

# 评估
metrics = trainer.evaluate(test_loader)
print(f"HR@10: {metrics['hr@10']:.4f}, NDCG@10: {metrics['ndcg@10']:.4f}")

3. 运行完整示例 / Run Complete Example

cd examples/generative
python run_hstu_movielens.py

Checklist / 检查清单

  • Code follows project style (ran python config/format_code.py) / 代码遵循项目风格(运行了格式化脚本)
  • Added tests for new functionality / 为新功能添加了测试
  • Updated documentation if needed / 如需要已更新文档
  • All tests pass locally / 所有测试在本地通过

Additional Notes / 附加说明

文件结构 / File Structure

torch_rechub/
├── basic/
│   └── layers.py                    # 新增 HSTULayer, HSTUBlock
├── models/
│   └── generative/
│       ├── __init__.py              # 新增
│       └── hstu.py                  # 新增 HSTUModel
├── trainers/
│   └── seq_trainer.py               # 新增通用序列训练器
├── utils/
│   ├── data.py                      # 新增 SeqDataset, SequenceDataGenerator
│   └── hstu_utils.py                # 新增 RelPosBias, VocabMask, VocabMapper
└── __init__.py                      # 更新导入

examples/generative/
├── data/ml-1m/
│   └── preprocess_ml_hstu.py        # 新增 MovieLens-1M 预处理
└── run_hstu_movielens.py            # 新增端到端示例

docs/zh/blog/
├── hstu_reproduction.md             # 新增 HSTU 复现说明
└── hstu_implementation_details.md   # 新增实现细节文档

性能指标 / Performance Metrics

在 MovieLens-1M 数据集上的初步测试结果:

  • HR@10: ~0.15-0.20(取决于超参数配置)
  • NDCG@10: ~0.08-0.12
  • 训练时间: 单 GPU (RTX 3090) 约 10-15 分钟/epoch

注:这些指标是初步结果,可通过调整超参数(如增加层数、维度、训练轮数)进一步优化。

未来改进方向 / Future Improvements

  1. 性能优化

    • 引入 sampled softmax 支持大规模词表
    • 可选的 FlashAttention 集成
    • 分布式训练支持
  2. 功能扩展

    • 支持更多数据集(Amazon Books, ML-20M 等)
    • 在线推理接口
    • 模型压缩与量化
  3. 文档完善

    • 英文文档翻译
    • 更多使用示例
    • 超参数调优指南

参考文献 / References


感谢审查!如有任何问题或建议,欢迎讨论。

Thank you for reviewing! Feel free to discuss any questions or suggestions.

1985312383 and others added 10 commits November 14, 2025 16:23
Introduces the HSTU (Hierarchical Sequential Transduction Units) generative recommendation model, including core layers, model definition, utility functions, and trainer. Adds MovieLens-1M preprocessing for HSTU, example scripts, and sequence data utilities. Updates package imports to support new generative components.
Major improvements to the MovieLens-1M preprocessing and HSTU model pipeline: preprocessing now uses a sliding window strategy to generate multiple training samples per user, includes time-difference features for time-aware modeling, and applies cold-start filtering. The HSTU model and layers now support time embeddings and causal masking. The training and evaluation scripts are updated to handle time-difference inputs and provide ranking metrics. These changes improve data efficiency, model expressiveness, and evaluation rigor.
Improved MovieLens-1M preprocessing for HSTU by switching to a sliding window strategy, adding time-difference features, and updating documentation and comments for clarity. Unified data format to include time-aware features, updated training and evaluation scripts to use time-aware positional encoding, and enhanced docstrings for HSTU layers and blocks. These changes align the implementation more closely with Meta's official HSTU logic and improve ranking metrics by better modeling temporal information.
Added 'HSTU Reproduction' to the blog section in mkdocs.yml for both English and Chinese navigation. Updated the Chinese blog post to summarize recent HSTU-related commits and removed outdated commit details.
Reformatted code across HSTU-related modules and data utilities to use more compact function calls and initialization patterns, reducing unnecessary line breaks and improving readability. No functional changes were made; this is a style and maintainability update.
Bumped yapf and isort versions in CI workflow to match pyproject.toml. Removed test_hstu_imports.py, which contained import and basic functionality tests for HSTU components.
@1985312383 1985312383 merged commit 222d87d into datawhalechina:main Nov 18, 2025
1 check passed
@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 7.89474% with 280 lines in your changes missing coverage. Please review.
✅ Project coverage is 36.39%. Comparing base (8dbb87a) to head (c30a1ab).
⚠️ Report is 38 commits behind head on main.

Files with missing lines Patch % Lines
torch_rechub/trainers/seq_trainer.py 11.49% 77 Missing ⚠️
torch_rechub/models/generative/hstu.py 0.00% 64 Missing ⚠️
torch_rechub/basic/layers.py 9.52% 57 Missing ⚠️
torch_rechub/utils/hstu_utils.py 0.00% 50 Missing ⚠️
torch_rechub/utils/data.py 18.91% 30 Missing ⚠️
torch_rechub/models/generative/__init__.py 0.00% 2 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #122      +/-   ##
==========================================
- Coverage   39.20%   36.39%   -2.81%     
==========================================
  Files          47       52       +5     
  Lines        2844     3283     +439     
==========================================
+ Hits         1115     1195      +80     
- Misses       1729     2088     +359     
Flag Coverage Δ
unittests 36.39% <7.89%> (-2.81%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@datawhalechina datawhalechina deleted a comment from LiuFan-libiao Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants