[ICRA 2026] LaViRA: Language-Vision-Robot Actions Translation for Zero-Shot Vision-and-Language Navigation in Continuous Environments
- Conda Environment Setup
- Evaluation Code
- R2R-CE
- RxR-CE
- Docker Environment Setup
- Real-world Deployment Code(Aloha & Go1)
The codebase has been tested with Python 3.9. Python 3.8 should also be compatible.
conda create -n lavira python=3.9
conda activate laviraHabitat-Sim is compiled locally in this setup. Alternatively, a precompiled conda version is available.
Habitat-Sim
git clone https://github.com/facebookresearch/habitat-sim.git
cd habitat-sim
git checkout tags/v0.1.7
pip install -r requirements.txt
CMAKE_ARGS="-DCMAKE_POLICY_VERSION_MINIMUM=3.5" python setup.py install --headlessHabitat-Lab
Note: The
tensorflow==1.13.1dependency must be removed fromhabitat_baselines/rl/requirements.txtprior to installation to avoid dependency conflicts.
git clone https://github.com/facebookresearch/habitat-lab.git
cd habitat-lab
git checkout tags/v0.1.7
cd habitat_baselines/rl
vi requirements.txt # Remove the line: tensorflow==1.13.1
cd ../../ # Return to habitat-lab root directory
pip install torch==2.1.0+cu121 torchvision==0.16.0 torchaudio==2.1.0+cu121 \
-f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
# Installs both habitat and habitat_baselines.
# If installation fails due to network issues, please retry.
python setup.py develop --allGroundingDINO is used to construct semantic maps for robot action planning.
git clone https://github.com/IDEA-Research/GroundingDINO.git
cd GroundingDINO
git checkout -q 57535c5a79791cb76e36fdb64975271354f10251
pip install -q -e . --no-build-isolation
pip install 'git+https://github.com/facebookresearch/segment-anything.git'
# Note: Upgrading setuptools and wheel may be required.Phrase-to-Class Mapping Optimization
Following CA-Nav, we recommend replacing the default phrases2classes function in Grounded-SAM with a minimum edit distance-based implementation for more robust and stable prediction outputs.
pip install nltkLocate the phrases2classes method at Line 235 of <YOUR_PATH>/GroundingDINO/groundingdino/util/inference.py. Comment out the original implementation and replace it with the following:
# @staticmethod
# def phrases2classes(phrases: List[str], classes: List[str]) -> np.ndarray:
# class_ids = []
# for phrase in phrases:
# try:
# class_ids.append(classes.index(phrase))
# except ValueError:
# class_ids.append(None)
# return np.array(class_ids)
from nltk.metrics import edit_distance
@staticmethod
def phrases2classes(phrases: List[str], classes: List[str]) -> np.ndarray:
class_ids = []
for phrase in phrases:
if phrase in classes:
class_ids.append(classes.index(phrase))
else:
distances = np.array([edit_distance(phrase, c) for c in classes])
class_ids.append(np.argmin(distances))
return np.array(class_ids)Then install the remaining dependencies:
pip install setuptools==58.5.3 meson-python ninja
pip install -r requirements.txt --use-pep517 --no-build-isolationDownload the Matterport3D scene dataset and the VLN-CE episode dataset, and place them under the data/ directory as described below.
Matterport3D Scenes
After obtaining the official download script (download_mp.py) by submitting the Terms of Use agreement at the Matterport3D project page, run:
python download_mp.py --task habitat -o data/scene_datasets/mp3d/VLN-CE Episodes
Download R2R_VLNCE_v1-3_preprocessed.zip (~250 MB) using the following command:
gdown https://drive.google.com/uc?id=1fo8F4NKgZDH-bPSdVU3cONAkt5EW-tyrDownload the required GroundedSAM model weights here and place them under data/grounded_sam/. The expected directory structure for the data/ folder is as follows:
data
├── grounded_sam
│ ├── groundingdino_swint_ogc.pth
│ ├── GroundingDINO_SwinT_OGC.py
│ ├── repvit_sam.pt
│ └── sam_vit_h_4b8939.pth
├── datasets
│ └── R2R_VLNCE_v1-3_preprocessed
│ ├── embeddings.json.gz
│ ├── envdrop
│ │ ├── envdrop_gt.json.gz
│ │ └── envdrop.json.gz
│ ├── joint_train_envdrop
│ │ ├── joint_train_envdrop_gt.json.gz
│ │ └── joint_train_envdrop.gz
│ ├── test
│ │ ├── test.json
│ │ └── test.json.gz
│ ├── train
│ │ ├── train_gt.json.gz
│ │ └── train.json.gz
│ ├── val_seen
│ │ ├── val_seen_gt.json.gz
│ │ └── val_seen.json.gz
│ └── val_unseen
│ ├── val_unseen_gt.json
│ ├── val_unseen_gt.json.gz
│ ├── val_unseen.json
│ └── val_unseen.json.gz
└── scene_datasets
└── mp3d
├── 17DRP5sb8fy
│ ├── 17DRP5sb8fy.glb
│ ├── 17DRP5sb8fy.house
│ ├── 17DRP5sb8fy.navmesh
│ └── 17DRP5sb8fy_semantic.ply
├── 1LXtFkjw3qL
│ ├── 1LXtFkjw3qL.glb
│ ├── 1LXtFkjw3qL.house
│ ├── 1LXtFkjw3qL.navmesh
│ └── 1LXtFkjw3qL_semantic.ply
└── ...Download from here to get val_unseen.json.gz and val_unseen_gt.json.gz and place them under data/datasets/R2R_VLNCE_v1-3_preprocessed/val_unseen.
Set the following environment variables before running evaluation:
| Variable | Description |
|---|---|
LA_API_KEY |
API key for the Language Action Model |
LA_BASE_URL |
Base URL for the Language Action Model endpoint |
LA_MODEL_NAME |
Model name for the Language Action Model |
VA_API_KEY |
API key for the Vision Action Model |
VA_BASE_URL |
Base URL for the Vision Action Model endpoint |
VA_MODEL_NAME |
Model name for the Vision Action Model |
Recommended Models: We recommend using gemini-2.5-pro or gpt-4o as the Language Action Model, and Qwen2.5-VL-32B-Instruct as the Vision Action Model.
Note: Some models (e.g., Qwen3-VL, Gemini) return bounding box coordinates in a relative format. If you wish to use a different Vision Action Model, the coordinate parsing logic may need to be adjusted accordingly.
Configure CUDA_VISIBLE_DEVICES and the number of parallel processes (nprocess) in eval_scripts/r2r.sh, then run:
bash eval_scripts/r2r.shOn a server equipped with an NVIDIA RTX 4090 GPU using 20 parallel processes, evaluation on the OpenNav-100 episodes completes in approximately 30 minutes.
Set inlavira_main.py Line 463:
self.visualize = TrueAfter finishing evaluation, run
python server.pyThe website will run on http://0.0.0.0:9999 by default.
If you find this work useful, please cite our paper:
@article{ding2025lavira,
title={LaViRA: Language-Vision-Robot Actions Translation for Zero-Shot Vision Language Navigation in Continuous Environments},
author={Ding, Hongyu and Xu, Ziming and Fang, Yudong and Wu, You and Chen, Zixuan and Shi, Jieqi and Huo, Jing and Zhang, Yifan and Gao, Yang},
journal={arXiv preprint arXiv:2510.19655},
year={2025}
}Our code is adapted from CA-Nav, thanks for their work!