Comparing some recent stereo algorithms in the wild

Published: June 01, 2022

Tons of progress has been made to estimate depth from a pair of stereo images. The current state-of-the-art methods all rely on deep learning to reach impressive accuracy levels on public benchmarks, even on hard textureless areas. But how do they actually perform in practice on stereo images captured in the wild? I am especially interested in indoor scenes (room scanning, object scanning), while the most frequently used large stereo benchmark is KITTI, which is focused on outdoor autonomous driving. So let’s see how well they generalize and what the performance/memory tradeoffs are for some of them.

Included methods

OpenCV StereoBM and StereoSGBM, using reasonable parameters. These “traditional” non deep-learning methods are based on block matching and meant to act as a baseline. Their memory footprint is tiny, they are very fast, but struggle with textureless areas and tend to be quite sparse and noisy. The implementation of OpenCV 4.5.5 is used.
CREStereo (CVPR 2022): “Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation”. Their objective is to provide a robust algorithm that can handle approximate calibrations and consumer-grade cameras. They also try to be memory efficient by using a local search window instead of a full cost volume. They share a pre-model pretrained on SceneFlow, Sintel, Middlebury, ETH3D, KITTI, Falling Things, InStereo2K and HR-VS. We use two variants converted to ONNX by PINTO0309:
- CREStereo-Accurate: 5 iterations, a more balanced speed/accuracy tradeoff.
- CREStereo-Fast: 2 iterations, aiming for speed.
RAFT-Stereo (3DV 2021): “Multilevel Recurrent Field Transforms for Stereo Matching”. Inspired by the success of RAFT for optical flow estimation, the authors adapted it to stereo. It is an iterative method that keeps refining the depth with multi-scale gated recurrent units. It is implemented using only 2D convolutions and a lighter 3D cost volume, and they propose a real-time setting. Two pre-trained variants are included (refer to the paper for details):
- RAFT-Accurate: high accuracy settings trained on sceneflow + middlebury.
- RAFT-Fast: real-time settings, trained on sceneflow.
Hitnet (CVPR 2021): “Hierarchical Iterative Tile Refinement Network for Real-time Stereo Matching”. They also focus on performance, avoid a full cost volume computation and adopt a coarse-to-fine propagation with small planar tiles. Here also two pre-trained variants are considered (converted to ONNX by PINTO0309):
- Hitnet-Accurate: high accuracy settings trained on sceneflow.
- Hitnet-Fast: fast settings, trained on KITTI + Middlebury + ETH3D images.
RealtimeStereo (ACCV 2020): “Attention-Aware Feature Aggregation for Real-time Stereo Matching on Edge Devices”. This paper targets real-time performance on embedded hardware like the NVidia Jetson TX2. The disparity is first computed at a very coarse resolution and progressively upsampled. The accuracy on datasets in the wild is not as good as the other methods, but I included it since it’s very fast and has a low memory footprint. The model was pre-trained by the authors on Sceneflow + KITTI.

Qualitative results

For a qualitative comparison I’ve captured images with my OAK-D Lite stereo camera, using the factory calibration and the rectification computed on the device itself. I configured it to output VGA images (640x480).

The visualization is done via stereodemo, a small utility I developed to compare stereo algorithms. It’s easy to try these methods yourself on new data with a pip install stereodemo.

Here is a video on a hands + desk scene:

Living room scene:

Bedroom scene:

I would argue that CREStereo and RAFT-Stereo consistently give the best results, with a small advantage to RAFT-Stereo when using the fast settings. Hitnet is pretty good too, but it would probably need to be trained on more indoor scenes to stop warping blank walls. The RealtimeStereo of Chang et al. does not generalize very well outside of the KITTI datasets were it was initially trained and evaluated.

Performance

Peak memory usage

Let’s first look at the peak memory usage, ordered from lowest to highest. For OpenCV-BM, SGBM, CREStereo and Hitnet I estimate it roughly by looking at the peak memory usage when using CPU inference with /usr/bin/time -v and substracting the memory usage when the inference is not called. For RAFT-Stereo and RealtimeStereo the peak is measured for GPU inference with torch.cuda.max_memory_allocated().

	Peak Memory Usage (MB)
	320x240	640x480	1280x720
OpenCV-BM	3	5	11
OpenCV-SGBM	6	6	13
RealtimeStereo	5	18	56
RAFT-Stereo (fast)	114	172	450
CREStereo (all)	126	458	1309
RAFT-Stereo (accurate)	179	530	1512
Hitnet (fast)	182	856	2516
Hitnet (accurate)	785	2179	6973

Overall the traditional methods have a low footprint (note that OpenCV default semi-global matching uses a memory-efficient variant). RealtimeStereo is also memory efficient by computing a cost volume only at a very coarse resolution and then only performing small corrections with a tiny disparity range search. RAFT-Stereo downsamples the input more aggressively in its fast settings, leading to a reasonable footprint. CREStereo has no memory usage difference between the fast and accurate settings, and Hitnet tends to be quite memory hungry.

Inference speed

First let’s look at the inference performance on a top-end gaming GPU (NVidia RTX 3090), ranked by their speed at VGA resolution. For RAFT-Stereo and RealtimeStereo the original Pytorch implementation from the authors was used. For CREStereo and Hitnet the ONNX models converted by PINTO0309 were used with onnxruntime.

Note that these results should be taken with a grain of salt as the ONNX conversion might not be optimal, and some methods have optimizations that were not included. In particular RAFT-Stereo has a faster correlation sampler (code available), and Hitnet claims to be up to 3x faster with custom CUDA operations (code not available).

	GPU Inference (RTX 3090)
	320x240	640x480	1280x720
OpenCV-BM	N/A	N/A	N/A
OpenCV-SGBM	N/A	N/A	N/A
RealtimeStereo	7 ms	8 ms	15 ms
RAFT-Stereo (fast)	30 ms	35 ms	60 ms
Hitnet (fast)	14 ms	40 ms	80 ms
CREStereo (fast)	21 ms	56 ms	175 ms
CREStereo (accurate)	30 ms	90 ms	280 ms
Hitnet (accurate)	34 ms	100 ms	280 ms
RAFT-Stereo (accurate)	110 ms	190 ms	430 ms

All the methods are pretty fast on a beefy GPU, especially in their fast settings. As expected CPU inference is much slower, even with 8 cores.

	CPU Inference (8 cores, i9-9900K @ 3.6 GHz)
	320x240	640x480	1280x720
OpenCV-BM	2.5 ms	9 ms	20 ms
OpenCV-SGBM (1 core)	12 ms	70 ms	230 ms
RealtimeStereo	15 ms	70 ms	180 ms
RAFT-Stereo (fast)	157 ms	550 ms	1800 ms
Hitnet (fast)	160 ms	720 ms	2340 ms
CREStereo (fast)	300 ms	1440 ms	5580 ms
CREStereo (accurate)	500 ms	2130 ms	8900 ms
Hitnet (accurate)	620 ms	2240 ms	6960 ms
RAFT-Stereo (accurate)	1720 ms	6800 ms	22200 ms

Finally here are the results with CPU inference and just 1 core.

	CPU Inference (1 core, i9-9900K @ 3.6 GHz)
	320x240	640x480	1280x720
OpenCV-BM	4 ms	22 ms	70 ms
OpenCV-SGBM	12 ms	70 ms	230 ms
RealtimeStereo	27 ms	130 ms	420 ms
RAFT-Stereo (fast)	680 ms	2600 ms	8300 ms
Hitnet (fast)	320 ms	1370 ms	4180 ms
CREStereo (fast)	980 ms	4240 ms	13450 ms
Hitnet (accurate)	1.1 s	4.1 s	12.2 s
CREStereo (accurate)	1.9 s	7.8 s	24 s
RAFT-Stereo (accurate)	8.5 s	32.5 s	102.5 s

Conclusion

The recent deep learning approaches are very impressive on hard scenes compared to traditional methods. The drawback is that they sometimes create large areas of good looking but inaccurate geometry (e.g. warped walls), while block matching methods return sparser and noisier depths, but won’t hallucinate wrong geometry.

RAFT-Stereo appears as a solid choice in terms of speed / memory / accuracy tradeoff with their fast settings, with good generalization. CREStereo and Hitnet are also competitive, with CREStereo often giving the nicest results in their accurate settings. Their model was also trained on more public datasets, this might help. Also these methods are quite easy to tune to reach a different speed tradeoff, and adding some room/indoor datasets during training would likely improve their accuracy.

RealtimeStereo is very fast, but the KITTI-trained model does not generalize well to indoor scenes.

Want to try these yourself, on pre-captured images or directly from an OAK-D camera? Just pip install stereodemo (https://github.com/nburrus/stereodemo).

Share on

Twitter Facebook LinkedIn

Nicolas Burrus