Comparing some recent stereo algorithms in the wild
Published:
Tons of progress has been made to estimate depth from a pair of stereo images. The current state-of-the-art methods all rely on deep learning to reach impressive accuracy levels on public benchmarks, even on hard textureless areas. But how do they actually perform in practice on stereo images captured in the wild? I am especially interested in indoor scenes (room scanning, object scanning), while the most frequently used large stereo benchmark is KITTI, which is focused on outdoor autonomous driving. So let’s see how well they generalize and what the performance/memory tradeoffs are for some of them.
Included methods
OpenCV StereoBM and StereoSGBM, using reasonable parameters. These “traditional” non deep-learning methods are based on block matching and meant to act as a baseline. Their memory footprint is tiny, they are very fast, but struggle with textureless areas and tend to be quite sparse and noisy. The implementation of OpenCV 4.5.5 is used.
- CREStereo (CVPR 2022): “Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation”. Their objective is to provide a robust algorithm that can handle approximate calibrations and consumer-grade cameras. They also try to be memory efficient by using a local search window instead of a full cost volume. They share a pre-model pretrained on SceneFlow, Sintel, Middlebury, ETH3D, KITTI, Falling Things, InStereo2K and HR-VS. We use two variants converted to ONNX by PINTO0309:
CREStereo-Accurate
: 5 iterations, a more balanced speed/accuracy tradeoff.CREStereo-Fast
: 2 iterations, aiming for speed.
- RAFT-Stereo (3DV 2021): “Multilevel Recurrent Field Transforms for Stereo Matching”. Inspired by the success of RAFT for optical flow estimation, the authors adapted it to stereo. It is an iterative method that keeps refining the depth with multi-scale gated recurrent units. It is implemented using only 2D convolutions and a lighter 3D cost volume, and they propose a real-time setting. Two pre-trained variants are included (refer to the paper for details):
RAFT-Accurate
: high accuracy settings trained on sceneflow + middlebury.RAFT-Fast
: real-time settings, trained on sceneflow.
- Hitnet (CVPR 2021): “Hierarchical Iterative Tile Refinement Network for Real-time Stereo Matching”. They also focus on performance, avoid a full cost volume computation and adopt a coarse-to-fine propagation with small planar tiles. Here also two pre-trained variants are considered (converted to ONNX by PINTO0309):
Hitnet-Accurate
: high accuracy settings trained on sceneflow.Hitnet-Fast
: fast settings, trained on KITTI + Middlebury + ETH3D images.
- RealtimeStereo (ACCV 2020): “Attention-Aware Feature Aggregation for Real-time Stereo Matching on Edge Devices”. This paper targets real-time performance on embedded hardware like the NVidia Jetson TX2. The disparity is first computed at a very coarse resolution and progressively upsampled. The accuracy on datasets in the wild is not as good as the other methods, but I included it since it’s very fast and has a low memory footprint. The model was pre-trained by the authors on Sceneflow + KITTI.
Qualitative results
For a qualitative comparison I’ve captured images with my OAK-D Lite stereo camera, using the factory calibration and the rectification computed on the device itself. I configured it to output VGA images (640x480).
The visualization is done via stereodemo, a small utility I developed to compare stereo algorithms. It’s easy to try these methods yourself on new data with a pip install stereodemo
.
Here is a video on a hands + desk scene:
Living room scene:
Bedroom scene:
I would argue that CREStereo and RAFT-Stereo consistently give the best results, with a small advantage to RAFT-Stereo when using the fast settings. Hitnet is pretty good too, but it would probably need to be trained on more indoor scenes to stop warping blank walls. The RealtimeStereo of Chang et al. does not generalize very well outside of the KITTI datasets were it was initially trained and evaluated.
Performance
Peak memory usage
Let’s first look at the peak memory usage, ordered from lowest to highest. For OpenCV-BM, SGBM, CREStereo and Hitnet I estimate it roughly by looking at the peak memory usage when using CPU inference with /usr/bin/time -v
and substracting the memory usage when the inference is not called. For RAFT-Stereo and RealtimeStereo the peak is measured for GPU inference with torch.cuda.max_memory_allocated()
.
Peak Memory Usage (MB) | |||
---|---|---|---|
320x240 | 640x480 | 1280x720 | |
OpenCV-BM | 3 | 5 | 11 |
OpenCV-SGBM | 6 | 6 | 13 |
RealtimeStereo | 5 | 18 | 56 |
RAFT-Stereo (fast) | 114 | 172 | 450 |
CREStereo (all) | 126 | 458 | 1309 |
RAFT-Stereo (accurate) | 179 | 530 | 1512 |
Hitnet (fast) | 182 | 856 | 2516 |
Hitnet (accurate) | 785 | 2179 | 6973 |
Overall the traditional methods have a low footprint (note that OpenCV default semi-global matching uses a memory-efficient variant). RealtimeStereo is also memory efficient by computing a cost volume only at a very coarse resolution and then only performing small corrections with a tiny disparity range search. RAFT-Stereo downsamples the input more aggressively in its fast settings, leading to a reasonable footprint. CREStereo has no memory usage difference between the fast and accurate settings, and Hitnet tends to be quite memory hungry.
Inference speed
First let’s look at the inference performance on a top-end gaming GPU (NVidia RTX 3090), ranked by their speed at VGA resolution. For RAFT-Stereo and RealtimeStereo the original Pytorch implementation from the authors was used. For CREStereo and Hitnet the ONNX models converted by PINTO0309 were used with onnxruntime
.
Note that these results should be taken with a grain of salt as the ONNX conversion might not be optimal, and some methods have optimizations that were not included. In particular RAFT-Stereo has a faster correlation sampler (code available), and Hitnet claims to be up to 3x faster with custom CUDA operations (code not available).
GPU Inference (RTX 3090) | |||
---|---|---|---|
320x240 | 640x480 | 1280x720 | |
OpenCV-BM | N/A | N/A | N/A |
OpenCV-SGBM | N/A | N/A | N/A |
RealtimeStereo | 7 ms | 8 ms | 15 ms |
RAFT-Stereo (fast) | 30 ms | 35 ms | 60 ms |
Hitnet (fast) | 14 ms | 40 ms | 80 ms |
CREStereo (fast) | 21 ms | 56 ms | 175 ms |
CREStereo (accurate) | 30 ms | 90 ms | 280 ms |
Hitnet (accurate) | 34 ms | 100 ms | 280 ms |
RAFT-Stereo (accurate) | 110 ms | 190 ms | 430 ms |
All the methods are pretty fast on a beefy GPU, especially in their fast settings. As expected CPU inference is much slower, even with 8 cores.
CPU Inference (8 cores, i9-9900K @ 3.6 GHz) | |||
---|---|---|---|
320x240 | 640x480 | 1280x720 | |
OpenCV-BM | 2.5 ms | 9 ms | 20 ms |
OpenCV-SGBM (1 core) | 12 ms | 70 ms | 230 ms |
RealtimeStereo | 15 ms | 70 ms | 180 ms |
RAFT-Stereo (fast) | 157 ms | 550 ms | 1800 ms |
Hitnet (fast) | 160 ms | 720 ms | 2340 ms |
CREStereo (fast) | 300 ms | 1440 ms | 5580 ms |
CREStereo (accurate) | 500 ms | 2130 ms | 8900 ms |
Hitnet (accurate) | 620 ms | 2240 ms | 6960 ms |
RAFT-Stereo (accurate) | 1720 ms | 6800 ms | 22200 ms |
Finally here are the results with CPU inference and just 1 core.
CPU Inference (1 core, i9-9900K @ 3.6 GHz) | |||
---|---|---|---|
320x240 | 640x480 | 1280x720 | |
OpenCV-BM | 4 ms | 22 ms | 70 ms |
OpenCV-SGBM | 12 ms | 70 ms | 230 ms |
RealtimeStereo | 27 ms | 130 ms | 420 ms |
RAFT-Stereo (fast) | 680 ms | 2600 ms | 8300 ms |
Hitnet (fast) | 320 ms | 1370 ms | 4180 ms |
CREStereo (fast) | 980 ms | 4240 ms | 13450 ms |
Hitnet (accurate) | 1.1 s | 4.1 s | 12.2 s |
CREStereo (accurate) | 1.9 s | 7.8 s | 24 s |
RAFT-Stereo (accurate) | 8.5 s | 32.5 s | 102.5 s |
Conclusion
The recent deep learning approaches are very impressive on hard scenes compared to traditional methods. The drawback is that they sometimes create large areas of good looking but inaccurate geometry (e.g. warped walls), while block matching methods return sparser and noisier depths, but won’t hallucinate wrong geometry.
RAFT-Stereo appears as a solid choice in terms of speed / memory / accuracy tradeoff with their fast settings, with good generalization. CREStereo and Hitnet are also competitive, with CREStereo often giving the nicest results in their accurate settings. Their model was also trained on more public datasets, this might help. Also these methods are quite easy to tune to reach a different speed tradeoff, and adding some room/indoor datasets during training would likely improve their accuracy.
RealtimeStereo is very fast, but the KITTI-trained model does not generalize well to indoor scenes.
Want to try these yourself, on pre-captured images or directly from an OAK-D camera? Just pip install stereodemo
(https://github.com/nburrus/stereodemo).