Introducing HMAR: Efficient Hierarchical Masked AutoRegressive Image Generation, that builds on VAR.

Thank you for the excellent work on VAR, I have really enjoyed building on this repository and wanted to share our new work HMAR: Efficient Hierarchical Masked AutoRegressive Image Generation, that builds on VAR.

Github: https://github.com/NVlabs/HMAR
Checkpoints: https://huggingface.co/nvidia/HMAR
Paper: https://arxiv.org/abs/2506.04421
Blog: https://research.nvidia.com/labs/dir/hmar/

We make the following changes to VAR that improve efficiency, quality and flexibility.

1. Markovian Assumption: We find that you do not need to condition on all the previous scales but only on the previous one to maintain good quality. This increases the inference speed by up to 1.7x and reduces the inference memory by up to 3x.
2. Faster training Kernels: FlashAttention does not support the types of attention masks used in VAR, we write our own Triton attention kernels to allow faster training up to 2.5x.
3. Hierarchical Masking: We introduce masked refinement at each scale, allowing to predict the tokens in each scale in multiple steps instead of a single step. This improves FID and Inception Score.
4. Loss Reweighting: We find that the difficulty of learning the scales has a log-normal pattern and we use this to reweight the loss across scales leading to improved performance.

We are really excited about VAR and the lines of work that follow!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introducing HMAR: Efficient Hierarchical Masked AutoRegressive Image Generation, that builds on VAR. #169

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Introducing HMAR: Efficient Hierarchical Masked AutoRegressive Image Generation, that builds on VAR. #169

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions