Comparing some recent stereo algorithms in the wild
Tons of progress has been made to estimate depth from a pair of stereo images. The current state-of-the-art methods all rely on deep learning to reach impressive accuracy levels on public benchmarks, even on hard textureless areas. But how do they actually perform in practice on stereo images captured in the wild? I am especially interested in indoor scenes (room scanning, object scanning), while the most frequently used large stereo benchmark is KITTI, which is focused on outdoor autonomous driving. So let’s see how well they generalize and what the performance/memory tradeoffs are for some of them.