📝 Note: By way of exception, I include one and only one image dataset, due to its size: 700K scenes and the incredible improvement in depth estimation results of the fine-tuned Depth Anything V2 ViT-B model on MegaSynth and evaluated on Hypersim. See the results in Table 6.
Dataset | Venue | Resolution | |
---|---|---|---|
1 | MegaSynth | 512×512 |
📝 Notes: 1) Do not use the SYNTHIA dataset for training HD video depth estimation models! The depth maps in this dataset do not match the corresponding RGB images. This is particularly evident in the example of tree leaves. Example pair: SYNTHIA-SEQS-01-SPRING\Depth\Stereo_Left\Omni_F\000071.png and SYNTHIA-SEQS-01-SPRING\RGB\Stereo_Left\Omni_F\000071.png.
2) Do not use the SynDrone dataset for training HD video depth estimation models! The depth maps in this dataset have large white areas of incorrect depth, which should not happen with a synthetic dataset. Example image: Town01_Opt_120_depth\Town01_Opt_120\ClearNoon\height20m\depth\00031.png.
Dataset | Venue | Resolution | G C |
M o G |
C 3 R |
D P |
U D 2 |
V D A |
D 2 U |
P O M |
R D |
B o T |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Spring | 1920×1080 | T | T | T | E | - | - | T | - | - | - | |
2 | HorizonGS | 1920×1080 | - | - | - | - | - | - | - | - | - | - | |
3 | MVS-Synth | 1920×1080 | T | T | T | T | - | - | - | - | - | - | |
4 | SynDrone 🚫 Do not use! 🚫 |
1920×1080 | - | - | - | - | - | - | - | - | - | - | |
5 | Mid-Air | 1024×1024 | T | T | - | - | - | - | - | - | - | - | |
6 | MatrixCity | 1000×1000 | T | T | - | - | T | - | - | - | - | - | |
7 | SAIL-VOS 3D | 1280×800 | - | - | - | T | - | - | - | - | - | - | |
8 | SYNTHIA-Seqs 🚫 Do not use! 🚫 |
1280×760 | T | T | - | - | - | - | - | - | - | - | |
9 | BEDLAM | 1280×720 | - | - | T | T | T | - | - | - | - | - | |
10 | Dynamic Replica | 1280×720 | T | - | T | T | T | - | - | T | - | - | |
11 | BlinkVision | 960×540 | - | - | - | - | - | - | T | - | - | - | |
12 | PointOdyssey | 960×540 | - | - | T | - | T | T | T | T | E | - | |
13 | DyDToF | 960×540 | - | - | - | - | - | - | - | - | E | - | |
14 | IRS | 960×540 | T | T | T | T | - | T | - | - | - | - | |
15 | Scene Flow | 960×540 | E | - | - | - | - | - | - | - | - | - | |
16 | THUD++ | 730×530 | - | - | - | - | - | - | - | - | - | - | |
17 | 3D Ken Burns | 512×512 | T | T | T | T | - | - | - | - | - | - | |
18 | TartanAir | 640×480 | T | T | T | T | T | T | T | T | T | - | |
19 | ParallelDomain-4D | 640×480 | - | - | - | - | - | - | - | T | - | - | |
20 | GTA-SfM | 640×480 | T | T | - | - | - | - | - | - | - | - | |
21 | InteriorNet | 640×480 | - | - | - | - | - | - | - | - | - | - | |
22 | MPI Sintel | 1024×436 | E | E | E | E | E | E | E | E | - | E | |
23 | Virtual KITTI 2 | 1242×375 | T | - | T | T | - | T | - | - | - | - | |
24 | TartanAir Shibuya | 640×360 | - | - | - | - | - | - | - | - | - | E | |
Total: T (training) | 11 | 9 | 9 | 8 | 5 | 4 | 4 | 4 | 1 | 0 | |||
Total: E (testing) | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 2 | 2 |
- Bonn RGB-D Dynamic (5 video clips with 110 frames each): AbsRel<=0.079
- NYU-Depth V2: AbsRel<=0.042 (relative depth)
- NYU-Depth V2: AbsRel<=0.051 (metric depth)
- Appendix 1: Rules for qualifying models for the rankings (to do)
- Appendix 2: Metrics selection for the rankings (to do)
- Appendix 3: List of all research papers from the above rankings
RK | Model Links: Venue Repository |
LPIPS ↓ {Input fr.} Table 1 M2SVid |
---|---|---|
1 | M2SVid |
0.180 {MF} |
2 | SVG |
0.217 {MF} |
3 | StereoCrafter |
0.242 {MF} |
📝 Note: 1) See Figure 4 2) The ranking order is determined in the first instance by a direct comparison of the scores of two models in the same paper. If there is no such direct comparison in any paper or there is a disagreement in different papers, the ranking order is determined by the best score of the compared two models in all papers that are shown in the columns as data sources. The DepthCrafter rank is based on the latest version 1.0.1.