This work delves into the task of pose-free novel view synthesis from stereo pairs, a challenging and pioneering task in 3D vision. Our innovative framework, unlike any before, seamlessly integrates 2D correspondence matching, camera pose estimation, and NeRF rendering, fostering a synergistic enhancement of these tasks. We achieve this through designing an architecture that utilizes a shared representation, which serves as a foundation for enhanced 3D geometry understanding. Capitalizing on the inherent interplay between the tasks, our unified framework is trained end-to-end with the proposed training strategy to improve overall model accuracy. Through extensive evaluations across diverse indoor and outdoor scenes from two real-world datasets, we demonstrate that our approach achieves substantial improvement over previous methodologies, especially in scenarios characterized by extreme viewpoint changes and the absence of accurate camera poses.

Qualitative Results

Qualitative comparisions on RealEstate10K Dataset.
Qualitative comparisons on ACID Dataset.
Visualization of epipolar lines from estimated poses.

Quantitative Results

The best-performing results are presented in bold, gray color indicates methods not directly comparable; they are included for reference only. We also specify the targeted task for each method. We evaluate the performance of our method on camera pose estimation and novel view synthesis.

Main Architecture

For a pair of images, we extract multi-level feature maps and construct 4D correlation maps at each level, encoding pixel pair similarities. These maps are refined for flow and pose estimation, and the renderer then uses the estimated pose and refined feature maps for color and depth computation.



The website template was borrowed from Michaƫl Gharbi.