Disentangling Content and Motion for Text-Based Neural Video Manipulation

1Iskenderun Technical University, 2Bogazici University, 3Imperial College London, 4Hacettepe University, 5Koc University

This page contains qualitative results accompanying our paper. Each set of qualitative results can be accessed with the links below, or simply scroll down to watch the videos in order. Specific sequences are linked with buttons for easy reference.

  1. Video Translation on 3D Shapes
  2. Video Translation on Fashion Videos
  3. Disentanglement Quality on 3D Shapes
  4. Continuous Spatiotemporal Sampling via Neural ODE
  5. Fashion Videos Dataset Overview

Abstract

Giving machines the ability to imagine possible new objects or scenes from linguistic descriptions and produce their realistic renderings is arguably one of the most challenging problems in computer vision. Recent advances in deep generative models have led to new approaches that give promising results towards this goal. In this paper, we introduce a new method called DiCoMoGAN for manipulating videos with natural language, aiming to perform local and semantic edits on a video clip to alter the appearances of an object of interest. Our GAN architecture allows for better utilization of multiple observations by disentangling content and motion to enable controllable semantic edits. To this end, we introduce two tightly coupled networks: (i) a representation network for constructing a concise understanding of motion dynamics and temporally invariant content, and (ii) a translation network that exploits the extracted latent content representation to actuate the manipulation according to the target description. Our qualitative and quantitative evaluations demonstrate that DiCoMoGAN significantly outperforms existing frame-based methods, producing temporally coherent and semantically more meaningful results.

Video

1. Video Translation on 3D Shapes

Our goal is to perform seamless and semantically meaningful edits on each video frame. Doing so, we need to preserve identity, motion dynamics and description-irrelevant regions intact.


Input Video
"a medium pink sphere"
"a medium green sphere"
"a medium blue sphere"
"a medium red sphere"

Input Video
"a big pink sphere"
"a big green sphere"
"a big blue sphere"
"a big red sphere"

Input Video
"a medium pink capsule"
"a big green cube"
"a small orange sphere"
"a big red cylinder"

2. Video Translation on Fashion Videos

Our goal is to perform seamless and semantically meaningful edits on each video frame. Doing so, we need to preserve identity, motion dynamics and description-irrelevant regions intact.


Input Video
"Red dress with short sleeves"
"Purple dress with short sleeves"
"Green dress with short sleeves"
"Yellow dress with short sleeves"

Input Video
"Red dress with long sleeves"
"Purple dress with long sleeves"
"Green dress with long sleeves"
"Yellow dress with long sleeves"

Input Video
"Red shirts"
"Purple jumpsuits"
"Green shorts"
"Yellow t-shirts"

3. Disentanglement Quality on 3D Shapes

DiCoMoGAN learns latent variables depicting highly interpretable concepts decomposed into text relevant, text irrelevant static, and dynamic features. Note that wall and floor colors are not mentioned in the descriptions during training.


Input


Reconstruction


Object Color


Object Shape


Object Scale


Wall Color


Floor Color


Dynamic

4. Continuous Spatiotemporal Sampling via Neural ODE

DiCoMoGAN has an advantage of utilizing latent ODEs that it allows us to interpolate in-between frames over time.


Input Video Frames

Spatiotemporal Sampling by Neural ODE

Here we interpolate 256 frames between first(t=0.0) and last(t=1.0) frames of input video thanks to Latent ODE.


5. Fashion Videos Dataset Overview

We collected Fashion Videos dataset from raw videos present in the website of an online clothing retailer by searching products in the cardigans, dresses, jackets, jeans, jumpsuits, shorts, skirts, tops and trousers categories. There are 3178 video clips (approximately 109K distinct frames), which we split into 2579 for training and 598 for testing.

Please do not hesitate to send us an e-mail to access Fashion Videos dataset.



BibTeX


    @inproceedings{Karacan_2022_BMVC,
    author    = {Levent Karacan and Tolga  Kerimoğlu and İsmail Ata İnan and Tolga Birdal and Erkut Erdem and Aykut Erdem},
    title     = {Disentangling Content and Motion for Text-Based Neural Video Manipulation},
    booktitle = {British Machine Vision Conference (BMVC)},
    year      = {2022}}