You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p><ahref="https://arxiv.org/abs/2504.19854">[Paper on ArXiv]</a> <ahref="https://github.com/declare-lab/nora">[Code on GitHub]</a> <ahref="https://huggingface.co/collections/declare-lab/nora-6811ba3e820ef362d9eca281">[Hugging Face]</a></p>
<fontcolor="061E61"><b>Figure 1:</b> NORA, as depicted in this figure, has three major components: (i) image encoder, (ii) vision language model, and (iii) FAST+ action tokenizer. The image encoder encodes the current state of the environment. Subsequently, the VLM predicts the next action in order to accomplish the input goal, given the current state. Thereafter, FAST+ decodes the VLM output tokens into actionable robot tokens.</font>
0 commit comments