This webpage highlights the performance of our method through visual comparisons and illustrative examples.
Our approach demonstrates the ability to semantically track objects across a scene. To visualize this, we subsample 100K Gaussians from the original 400K used per video, shrinking their scaling matrix to a sphere of radius 1 and rendering their motion over time. This subsampling may reduce the visibility of some shapes, but it provides a clearer view of individual Gaussian behavior.
The Gaussians effectively remain stationary when representing static objects and exhibit expected semantic motions, such as buildings moving along their corresponding paths. In the campfire video, Gaussians with seemingly random colors—primarily in the sky—represent small-radius Gaussians that contribute to fine details. This behavior results from the use of a high number of Gaussians, emphasizing our method's capability for detailed and semantic motion representation.
Our method's inherent flexibility enables spatial resampling by modifying parameters such as scale, focal length, and principal points. Below, we demonstrate the capability to adjust resolution while preserving sharpness and structural details, even under significant transformations.
The stylization method we use is editing the first frame using an off-the-shelf editing model, and then training with a reconstruction loss against the first frame. Due to this, the editing quality heavily depends on the quality of the edited first frame.