Skip to content

Introducing HMAR: Efficient Hierarchical Masked AutoRegressive Image Generation, that builds on VAR. #169

@Kumbong

Description

@Kumbong

Thank you for the excellent work on VAR, I have really enjoyed building on this repository and wanted to share our new work HMAR: Efficient Hierarchical Masked AutoRegressive Image Generation, that builds on VAR.

Github: https://github.com/NVlabs/HMAR
Checkpoints: https://huggingface.co/nvidia/HMAR
Paper: https://arxiv.org/abs/2506.04421
Blog: https://research.nvidia.com/labs/dir/hmar/

We make the following changes to VAR that improve efficiency, quality and flexibility.

  1. Markovian Assumption: We find that you do not need to condition on all the previous scales but only on the previous one to maintain good quality. This increases the inference speed by up to 1.7x and reduces the inference memory by up to 3x.
  2. Faster training Kernels: FlashAttention does not support the types of attention masks used in VAR, we write our own Triton attention kernels to allow faster training up to 2.5x.
  3. Hierarchical Masking: We introduce masked refinement at each scale, allowing to predict the tokens in each scale in multiple steps instead of a single step. This improves FID and Inception Score.
  4. Loss Reweighting: We find that the difficulty of learning the scales has a log-normal pattern and we use this to reweight the loss across scales leading to improved performance.

We are really excited about VAR and the lines of work that follow!

Metadata

Metadata

Assignees

No one assigned

    Labels

    ResearchThird-party repositories or research which use VAR

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions