GaussianVideo: Efficient Video Representation via Hierarchical Gaussian Splatting

Andrew Bond^1,*, Jui-Hsien Wang², Long Mai², Erkut Erdem³, Aykut Erdem¹

^*Done during an internship at Adobe Research
¹Koç University ²Adobe Research ³Hacettepe University

Code arXiv

Teaser Figure — We present GaussianVideo -- a new Gaussian Splatting framework for representing videos that effectively models in-the-wild videos, while maintaining training efficiency and capturing semantic motions with minimal supervision. (a) We can render the 960x540 video at 93~FPS on an NVIDIA A40 GPU. (b) Our reconstruction PSNR on this video achieves 44.21 compared to NeRV at 29.36, representing a 50.6% improvement. (c) On the DAVIS dataset, our approach balances reconstruction quality with training time (dot size in log scale).

Method Figure — Overview of the GaussianVideo approach for neural video representation. Our method combines 3D Gaussian splatting with continuous camera motion modeling via Neural ODEs to handle dynamic scenes efficiently. The pipeline includes hierarchical learning strategies for both (a) spatial and (b) temporal domains, progressively refining Gaussians to capture fine details and smooth motion.

Overview

This webpage highlights the performance of our method through visual comparisons and illustrative examples.

Visual Comparisons: Compare our method with GaussianImage, NeRV, and HNeRV across various videos. Use the sliders to view side-by-side renderings and observe the differences in detail and quality.
Emergent Semantic Tracking: Our model naturally tracks dynamic elements in videos without additional supervision. Each example shows the full reconstruction using our method (left) and a version rendered with only 100K Gaussians, highlighting how Gaussians move in dynamic and static parts of the scene.
Frame Interpolation: We demonstrate the power of continuous motion representation in GaussianVideo through frame interpolation. The model can interpolate frames at arbitrary timesteps, even those not seen during training. Example videos show the case where we double the original framerate by interpolating between existing frames.
Spatial Resampling: Due to the use of a explicit spatial representation for the video, we are able to arbitrarily resample different resolutions and different camera viewports.
Video Stylization: Our approach applies style transfer across entire video sequences. Starting with an edited first frame, the model propagates the style consistently through the video, maintaining structural details as well as temporal coherence.

Ours (Left) vs NeRV (Right) Comparison

Comparison between our method (left) and NeRV (right) on a video from the DL3DV dataset. Our approach demonstrates superior temporal consistency and preserves sharper edges with finer details. Notably, the trees on the left and right remain crisp and rendered cleanly in our reconstruction, while they appear blurry in the NeRV output.

Comparison of our method and NeRV on another video from the DL3DV dataset. While NeRV struggles with consistency and fails to reconstruct fine details—such as the wires on the rocks and the intricate features of the elephant—our approach excels, delivering a reconstruction that captures these elements with remarkable clarity and precision.

Comparison of our method and NeRV on yet another video from the DAVIS dataset. NeRV exhibits significant inconsistencies, particularly evident in the lack of detail on the man’s face. In contrast, our method accurately preserves and reconstructs these fine details, demonstrating superior fidelity and consistency.

Comparison of our method and NeRV on yet another video from the DAVIS dataset. Much of the background of the NeRV video is blurry and inconsistent. Furthermore, many of the details on the top of the ship are also missing or barely visible. Meanwhile, our approach provides clear and detailed reconstructions of the whole scene.

Comparison of our method and NeRV on yet another video from the DAVIS dataset. Similar to the parkour video above, much of the detail of the person is missing in the NeRV video, while our method is able to properly reconstruct all of the detail.

Ours (Left) vs HNeRV (Right) Comparison

Comparison of our method and HNeRV on a video from the DAVIS dataset. Similar to NeRV, HNeRV suffers from consistency issues throughout the video, particularly noticeable around edges and in the details of the bushes and trees in the background. In contrast, our method maintains stable and coherent reconstructions.

Comparison of our method and HNeRV on another video from the DAVIS dataset. HNeRV struggles to render the rocks along the track clearly, resulting in a blurry and indistinct representation. In contrast, our method delivers a sharp and detailed reconstruction of the rocks, highlighting its superior ability to capture fine details.

Comparison of our method and HNeRV on another video from the DAVIS dataset. As with the previous examples, the HNeRV video appears blurrier in several regions and exhibits less overall consistency compared to our method.

Comparison of our method and HNeRV on another video from the DAVIS dataset. The texture of the pigs is rendered more clearly in our video, whereas the HNeRV reconstruction smooths over these details, resulting in a loss of texture and fine features.

Comparison of our method and HNeRV on another video from the DAVIS dataset. The HNeRV video is unable to fully reconstruct the fine textures of the fish, such as the scales, while our approach is able to handle very high detail in those locations.

Ours (Left) vs GaussianImage (Right) Comparison

Comparison of our method and GaussianImage on a video from the DL3DV dataset. The sky in the GaussianImage video exhibits noticeable noise, whereas our method produces a smoother and more uniform sky, reflecting higher reconstruction quality.

Comparison of our method and GaussianImage on another video from the DAVIS dataset. The concrete side of the water in the GaussianImage video displays noticeable noise, which is absent in our reconstruction, highlighting the superiority of our approach.

Semantic Tracking

Our approach demonstrates the ability to semantically track objects across a scene. To visualize this, we subsample 100K Gaussians from the original 400K used per video, shrinking their scaling matrix to a sphere of radius 1 and rendering their motion over time. This subsampling may reduce the visibility of some shapes, but it provides a clearer view of individual Gaussian behavior.

The Gaussians effectively remain stationary when representing static objects and exhibit expected semantic motions, such as buildings moving along their corresponding paths. In the campfire video, Gaussians with seemingly random colors—primarily in the sky—represent small-radius Gaussians that contribute to fine details. This behavior results from the use of a high number of Gaussians, emphasizing our method's capability for detailed and semantic motion representation.

Frame Interpolation

Frame interpolation results on a video from the DAVIS dataset. The number of frames is doubled while maintaining the original framerate by interpolating seamlessly between the existing frames, preserving smooth motion and temporal consistency.

Frame interpolation results on another video from the DAVIS dataset. The number of frames is doubled by interpolating between the existing frames, ensuring smooth transitions and maintaining the original framerate.

Spatial Resampling

Our method's inherent flexibility enables spatial resampling by modifying parameters such as scale, focal length, and principal points. Below, we demonstrate the capability to adjust resolution while preserving sharpness and structural details, even under significant transformations.

Sample spatial resampling result. The video is spatially resampled by doubling the height while halving the width, effectively demonstrating the flexibility of our method in adjusting resolution without compromising visual quality.

Another sample spatial resampling result. The video is spatially resampled by doubling both the height and width, demonstrating the ability of our method to handle significant resolution adjustments while maintaining sharpness and detail.

Video Stylization

The stylization method we use is editing the first frame using an off-the-shelf editing model, and then training with a reconstruction loss against the first frame. Due to this, the editing quality heavily depends on the quality of the edited first frame.

A video stylization result. The original video is edited using the description of making the water muddy.