Abstract
We present a method to perform novel view and time synthesis of dynamic scenes, requiring only a monocular video with known camera poses as input.
To do this, we introduce Neural Scene Flow Fields, a new representation that models the dynamic scene as a time-variant continuous function of appearance, geometry, and 3D scene motion.
Our representation is optimized through a neural network to fit the observed input views.
We show that our representation can be used for varieties of in-the-wild scenes, including thin structures, view-dependent effects, and complex degrees of motion.
We conduct a number of experiments that demonstrate our approach significantly outperforms recent monocular view synthesis methods, and show qualitative results of space-time view synthesis on a variety of real-world videos
Zhengqi Li: Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes
Introduction The topic of novel view synthesis has recently seen impressive progress due to the use of neural networks to learn representations that are well suited for view synthesis tasks.
Most prior approaches in this domain make the assumption that the scene is static, or that it is observed from multiple synchronized input views.
However, these restrictions are violated by most videos shared on the Internet today, which frequently feature scenes with diverse dynamic content (e.g., humans, animals, vehicles), recorded by a single camera.
We present a new approach for novel view and time synthesis of dynamic scenes from monocular video input with known (or derivable) camera poses.
This problem is highly ill-posed since there can be multiple scene configurations that lead to the same observed image sequences.
In addition, using multi-view constraints for moving objects is challenging, as doing so requires knowing the dense 3D motion of all scene points (i.e., the “scene flow”).
In this work, we propose to represent a dynamic scene as a continuous function of both space and time, where its output consists of not only reflectance and density, but also 3D scene motion.
Similar to prior work, we parameterize this function with a deep neural network (a multi-layer perceptron, MLP), and perform rendering using volume tracing.
We optimize the weights of this MLP using a scene flow fields warping loss that enforces that our scene representation is temporally consistent with the input views.
Crucially, as we model dense scene flow fields in 3D, our function can represent the sharp motion discontinuities that arise when projecting the scene into image space, even with simple low level 3D smoothness priors.
Further, dense scene flow fields also enable us to interpolate along changes in both space and time.
To the best our knowledge, our approach is the first to achieve novel view and time synthesis of dynamic scenes captured from a monocular camera.
As the problem is very challenging, we introduce different components that improve rendering quality over a baseline solution.
Specifically, we analyze scene flow ambiguity at motion disocclusions and propose a solution to it.
We also show how to use data-driven priors to avoid local minima during optimization, and describe how to effectively combine a static scene representation with a dynamic one which lets us render views with higher quality by leveraging multi-view constraints in static regions.
In summary, our key contributions include:
(1) a neural representation for space-time view synthesis of dynamic scenes that we call Neural Scene Flow Fields, that has the capacity to model 3D scene dynamics, and
(2) a method for optimizing Neural Scene Flow Fields on monocular video by leveraging multiview constraints in both rigid and non-rigid regions, allowing us to synthesize and interpolate both view and time simultaneously
Full article here