Quankai Gao1,
Iliyan Georgiev2,
Tuanfeng Y. Wang2,
Krishna Kumar Singh2,
Ulrich Neumann1+,
Jae Shin Yoon2+
1USC 2Adobe Research
In this project, we introduce Can3Tok, the first 3D scene-level variational autoencoder (VAE) capable of encoding a large number of Gaussian primitives into a low-dimensional latent embedding, which enables high-quality and efficient generative modeling of complex 3D scenes.
git clone https://github.com/Zerg-Overmind/Can3Tok.git
cd Can3Tok
We provide a conda environment file for easy installation. Please run the following command to create the environment:
bash env_in_one_shot.sh
and then activate it:
conda activate can3tok
Please refer to the official repo to install lang-sam for implementing our Semantics-aware filtering as in groundedSAM.py
. Note that the pytorch version compatible with the latest lang-sam is torch==2.4.1+cu121
instead of torch==2.1.0+cu121
in our env_in_one_shot.sh
, please modify the environment file accordingly if you want to use the latest lang-sam.
- We firstly run structure-from-motion (SfM) on DL3DV-10K dataset with COLMAP to get the camera parameters and sparse point clouds i.e. SfM points.
- Then, two options are allowed for applying 3DGS optimization on Dl3DV-10K dataset with camera parameters and SfM points initialized as above.
- Option 1: We first normalize camera parameters (centers/translation only) and SfM points into a unit (or a predefined radius
target_radius
in the code) sphere, and then run 3DGS optimization afterwards. You might want to checkdown_sam_init_sfm.py
for the details. - Option 2: Or, we can run 3DGS optimization first, and then normalize camera parameters (centers/translation only) and the optimized 3D Gaussians into a unit (or a predefined radius
target_radius
in the code) sphere as a post-processing by normalizing their positions and anisotropic scaling factors. Please refer tosfm_camera_norm.py
for the implementation of normalization. Additionally, please refer to ourtrain.py
and related scripts for 3DGS optimization, which ensure that the output filenames match the corresponding input scenes from the DL3DV-10K dataset.
- Option 1: We first normalize camera parameters (centers/translation only) and SfM points into a unit (or a predefined radius
- (optional) We can optionally run Semantics-aware filtering with lang_sam to filter out the 3D Gaussians that are not relevant to the main objects of interest in the scene.The implementation is provided in
groundedSAM.py
which includes built-in 3DGS normalization—so there is no need to perform normalization (in step 2. above) separately. That is, we can directly rungroundedSAM.py
after running 3DGS optimization (step.1). The output of this step is a filtered 3D Gaussian splatting point cloud, which is saved in the same output folder after 3DGS optimization for each scene. - Finally, we can run Can3Tok training and testing with 3D Gaussians (w/ or w/o filtering) as input. Please refer to
gs_can3tok.py
for the implementation.
To enable uniform training of Can3Tok across thousands of diverse scenes, we enforce a consistent number of 3D Gaussians per scene. A naive approach would be to initialize the 3DGS representation of each scene using the same number of SfM points while disabling densification and pruning. However, this often leads to suboptimal results. Instead, our logic for densification and pruning is as follows:
- we start densification as official implementation of 3DGS from iteration opt.densify_from_iter, e.g. 7000.
- we perform densification until iteration opt.densify_until_iter, e.g. 15000.
- At iteration opt.densify_until_iter, we prune the number of Gaussians to be exactly dataset.num_gs_per_scene_end**2, e.g. 200*200 = 40000.
- After that, we continue 3DGS optimization until opt.iterations, e.g. 30000.
This is to make sure that we have a fixed number of Gaussians at the end of training for each scene, while with small PSNR degradation. Please refer to the code in
train.py
for details. We also enable the hint and code for starting from a fixed number of SfM points as initialization inscene/dataset_readers.py
.
If you've already have 3DGS results for DL3DV-10K dataset, you can skip the 3DGS optimization step and directly use groundedSAM.py
to crop out a user-specific number of Gaussians for each scene, e.g. 40K or 100K etc, for training Can3Tok. You will also need to modify the output size of decoder MLP at here to match the input number of 3DGS.
To train Can3Tok, please run the following command:
python gs_can3tok.py
where you might want to modify the path pointing to the 3D Gaussians path and output path in the script. For evaluation, please uncomment the evaluation part in gs_can3tok.py
.
We also provide the code for training and evaluating the baselines in gs_pointtransformer.py
, gs_ae.py
, gs_pointvae.py
and etc. Also, please refer to tsne_exp*
for the t-SNE visualization of the latent space of Can3Tok and baselines.
Feel free to explore various generative applications using Can3Tok, such as 3D scene synthesis with various diffusion models!
We would like to thank the authors of the following repositories for their open-source code and datasets, which we built upon in this work:
If you find our code or paper useful, please consider citing:
@INPROCEEDINGS{gao2023ICCV,
author = {Quankai Gao and Iliyan Georgiev and Tuanfeng Y. Wang and Krishna Kumar Singh and Ulrich Neumann and Jae Shin Yoon},
title = {Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2025}
}