Comparing some recent stereo algorithms in the wild

Published:

Tons of progress has been made to estimate depth from a pair of stereo images. The current state-of-the-art methods all rely on deep learning to reach impressive accuracy levels on public benchmarks, even on hard textureless areas. But how do they actually perform in practice on stereo images captured in the wild? I am especially interested in indoor scenes (room scanning, object scanning), while the most frequently used large stereo benchmark is KITTI, which is focused on outdoor autonomous driving. So let’s see how well they generalize and what the performance/memory tradeoffs are for some of them.

Included methods

  • OpenCV StereoBM and StereoSGBM, using reasonable parameters. These “traditional” non deep-learning methods are based on block matching and meant to act as a baseline. Their memory footprint is tiny, they are very fast, but struggle with textureless areas and tend to be quite sparse and noisy. The implementation of OpenCV 4.5.5 is used.

  • CREStereo (CVPR 2022): “Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation”. Their objective is to provide a robust algorithm that can handle approximate calibrations and consumer-grade cameras. They also try to be memory efficient by using a local search window instead of a full cost volume. They share a pre-model pretrained on SceneFlow, Sintel, Middlebury, ETH3D, KITTI, Falling Things, InStereo2K and HR-VS. We use two variants converted to ONNX by PINTO0309:
    • CREStereo-Accurate: 5 iterations, a more balanced speed/accuracy tradeoff.
    • CREStereo-Fast: 2 iterations, aiming for speed.
  • RAFT-Stereo (3DV 2021): “Multilevel Recurrent Field Transforms for Stereo Matching”. Inspired by the success of RAFT for optical flow estimation, the authors adapted it to stereo. It is an iterative method that keeps refining the depth with multi-scale gated recurrent units. It is implemented using only 2D convolutions and a lighter 3D cost volume, and they propose a real-time setting. Two pre-trained variants are included (refer to the paper for details):
    • RAFT-Accurate: high accuracy settings trained on sceneflow + middlebury.
    • RAFT-Fast: real-time settings, trained on sceneflow.
  • Hitnet (CVPR 2021): “Hierarchical Iterative Tile Refinement Network for Real-time Stereo Matching”. They also focus on performance, avoid a full cost volume computation and adopt a coarse-to-fine propagation with small planar tiles. Here also two pre-trained variants are considered (converted to ONNX by PINTO0309):
    • Hitnet-Accurate: high accuracy settings trained on sceneflow.
    • Hitnet-Fast: fast settings, trained on KITTI + Middlebury + ETH3D images.
  • RealtimeStereo (ACCV 2020): “Attention-Aware Feature Aggregation for Real-time Stereo Matching on Edge Devices”. This paper targets real-time performance on embedded hardware like the NVidia Jetson TX2. The disparity is first computed at a very coarse resolution and progressively upsampled. The accuracy on datasets in the wild is not as good as the other methods, but I included it since it’s very fast and has a low memory footprint. The model was pre-trained by the authors on Sceneflow + KITTI.

Qualitative results

For a qualitative comparison I’ve captured images with my OAK-D Lite stereo camera, using the factory calibration and the rectification computed on the device itself. I configured it to output VGA images (640x480).

The visualization is done via stereodemo, a small utility I developed to compare stereo algorithms. It’s easy to try these methods yourself on new data with a pip install stereodemo.

Here is a video on a hands + desk scene:

Living room scene:

Bedroom scene:

I would argue that CREStereo and RAFT-Stereo consistently give the best results, with a small advantage to RAFT-Stereo when using the fast settings. Hitnet is pretty good too, but it would probably need to be trained on more indoor scenes to stop warping blank walls. The RealtimeStereo of Chang et al. does not generalize very well outside of the KITTI datasets were it was initially trained and evaluated.

Performance

Peak memory usage

Let’s first look at the peak memory usage, ordered from lowest to highest. For OpenCV-BM, SGBM, CREStereo and Hitnet I estimate it roughly by looking at the peak memory usage when using CPU inference with /usr/bin/time -v and substracting the memory usage when the inference is not called. For RAFT-Stereo and RealtimeStereo the peak is measured for GPU inference with torch.cuda.max_memory_allocated().

Peak Memory Usage (MB)
320x240640x4801280x720
OpenCV-BM3511
OpenCV-SGBM6613
RealtimeStereo51856
RAFT-Stereo (fast)114172450
CREStereo (all)1264581309
RAFT-Stereo (accurate)1795301512
Hitnet (fast)1828562516
Hitnet (accurate)78521796973

Overall the traditional methods have a low footprint (note that OpenCV default semi-global matching uses a memory-efficient variant). RealtimeStereo is also memory efficient by computing a cost volume only at a very coarse resolution and then only performing small corrections with a tiny disparity range search. RAFT-Stereo downsamples the input more aggressively in its fast settings, leading to a reasonable footprint. CREStereo has no memory usage difference between the fast and accurate settings, and Hitnet tends to be quite memory hungry.

Inference speed

First let’s look at the inference performance on a top-end gaming GPU (NVidia RTX 3090), ranked by their speed at VGA resolution. For RAFT-Stereo and RealtimeStereo the original Pytorch implementation from the authors was used. For CREStereo and Hitnet the ONNX models converted by PINTO0309 were used with onnxruntime.

Note that these results should be taken with a grain of salt as the ONNX conversion might not be optimal, and some methods have optimizations that were not included. In particular RAFT-Stereo has a faster correlation sampler (code available), and Hitnet claims to be up to 3x faster with custom CUDA operations (code not available).

GPU Inference (RTX 3090)
320x240640x4801280x720
OpenCV-BMN/AN/AN/A
OpenCV-SGBMN/AN/AN/A
RealtimeStereo7 ms8 ms15 ms
RAFT-Stereo (fast)30 ms35 ms60 ms
Hitnet (fast)14 ms40 ms80 ms
CREStereo (fast)21 ms56 ms175 ms
CREStereo (accurate)30 ms90 ms280 ms
Hitnet (accurate)34 ms100 ms280 ms
RAFT-Stereo (accurate)110 ms190 ms430 ms

All the methods are pretty fast on a beefy GPU, especially in their fast settings. As expected CPU inference is much slower, even with 8 cores.

CPU Inference (8 cores, i9-9900K @ 3.6 GHz)
320x240640x4801280x720
OpenCV-BM2.5 ms9 ms20 ms
OpenCV-SGBM (1 core)12 ms70 ms230 ms
RealtimeStereo15 ms70 ms180 ms
RAFT-Stereo (fast)157 ms550 ms1800 ms
Hitnet (fast)160 ms720 ms2340 ms
CREStereo (fast)300 ms1440 ms5580 ms
CREStereo (accurate)500 ms2130 ms8900 ms
Hitnet (accurate)620 ms2240 ms6960 ms
RAFT-Stereo (accurate)1720 ms6800 ms22200 ms

Finally here are the results with CPU inference and just 1 core.

CPU Inference (1 core, i9-9900K @ 3.6 GHz)
320x240640x4801280x720
OpenCV-BM4 ms22 ms70 ms
OpenCV-SGBM12 ms70 ms230 ms
RealtimeStereo27 ms130 ms420 ms
RAFT-Stereo (fast)680 ms2600 ms8300 ms
Hitnet (fast)320 ms1370 ms4180 ms
CREStereo (fast)980 ms4240 ms13450 ms
Hitnet (accurate)1.1 s4.1 s12.2 s
CREStereo (accurate)1.9 s7.8 s24 s
RAFT-Stereo (accurate)8.5 s32.5 s102.5 s

Conclusion

The recent deep learning approaches are very impressive on hard scenes compared to traditional methods. The drawback is that they sometimes create large areas of good looking but inaccurate geometry (e.g. warped walls), while block matching methods return sparser and noisier depths, but won’t hallucinate wrong geometry.

RAFT-Stereo appears as a solid choice in terms of speed / memory / accuracy tradeoff with their fast settings, with good generalization. CREStereo and Hitnet are also competitive, with CREStereo often giving the nicest results in their accurate settings. Their model was also trained on more public datasets, this might help. Also these methods are quite easy to tune to reach a different speed tradeoff, and adding some room/indoor datasets during training would likely improve their accuracy.

RealtimeStereo is very fast, but the KITTI-trained model does not generalize well to indoor scenes.

Want to try these yourself, on pre-captured images or directly from an OAK-D camera? Just pip install stereodemo (https://github.com/nburrus/stereodemo).