Hi, thanks for the great work!
I tried to reproduce the VSI-Bench results reported in Table 1 of the paper using the released Spatial-TTT-nano checkpoint and the evaluation code in this repo. Following the instructions in README.md and evaluation/spatial/readme.md exactly, I consistently get 60.62% overall, which is 3.8pp below the reported 64.4%.
Reproduction Setup
- Checkpoint:
THU-SI/Spatial-TTT-nano
- Base model:
Qwen/Qwen3-VL-2B-Instruct
- Eval command:
bash evaluation/spatial/scripts/eval_spatial_ttt_2b.sh /path/to/spatial-ttt-nano official 8
- Full dataset: 5130 samples, 128 frames, 352x480 resolution
- Hardware: 8x RTX 4500 Ada 24GB
- Environment: Python 3.10, PyTorch 2.8.0+cu126, transformers 4.57.0
I ran the evaluation twice (once in the original repo, once in a clean git worktree) and got identical results both times, confirming determinism.
Per-Category Comparison
| Category |
Paper (Table 1) |
Reproduced |
Gap |
| Object Count |
70.8 |
65.5 |
-5.3 |
| Absolute Distance |
47.8 |
43.8 |
-4.0 |
| Object Size |
71.7 |
68.8 |
-2.9 |
| Room Size |
65.9 |
59.5 |
-6.4 |
| Relative Distance |
61.8 |
59.6 |
-2.2 |
| Relative Direction |
73.0 |
73.2 |
+0.2 |
| Route Planning |
47.4 |
43.8 |
-3.6 |
| Approach Order |
77.0 |
70.7 |
-6.3 |
| Overall |
64.4 |
60.62 |
-3.8 |
Questions
-
The README mentions that Spatial-TTT-nano is "trained on a mini spatial dataset with less than 1M samples," and the TODO lists "Update the full model (trained on all data)" as pending. Are the Table 1 results from a different (full) checkpoint that hasn't been released yet?
-
If so, is there a timeline for releasing the full checkpoint that reproduces the reported numbers?
-
If the released nano checkpoint is expected to match Table 1, could you share any additional configuration or steps I might be missing?
Thanks in advance!
Hi, thanks for the great work!
I tried to reproduce the VSI-Bench results reported in Table 1 of the paper using the released Spatial-TTT-nano checkpoint and the evaluation code in this repo. Following the instructions in
README.mdandevaluation/spatial/readme.mdexactly, I consistently get 60.62% overall, which is 3.8pp below the reported 64.4%.Reproduction Setup
THU-SI/Spatial-TTT-nanoQwen/Qwen3-VL-2B-Instructbash evaluation/spatial/scripts/eval_spatial_ttt_2b.sh /path/to/spatial-ttt-nano official 8I ran the evaluation twice (once in the original repo, once in a clean git worktree) and got identical results both times, confirming determinism.
Per-Category Comparison
Questions
The README mentions that Spatial-TTT-nano is "trained on a mini spatial dataset with less than 1M samples," and the TODO lists "Update the full model (trained on all data)" as pending. Are the Table 1 results from a different (full) checkpoint that hasn't been released yet?
If so, is there a timeline for releasing the full checkpoint that reproduces the reported numbers?
If the released nano checkpoint is expected to match Table 1, could you share any additional configuration or steps I might be missing?
Thanks in advance!