CLIPInverter

CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing


Abstract

Researchers have recently begun exploring the use of StyleGAN-based models for real image editing. One particularly interesting application is using natural language descriptions to guide the editing process. Existing approaches for editing images using language either resort to instance-level latent code optimization or map predefined text prompts to some editing directions in the latent space. However, these approaches have inherent limitations. The former is not very efficient, while the latter often struggles to effectively handle multi-attribute changes. To address these weaknesses, we present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes. The core of our method is the use of novel, lightweight text-conditioned adapter layers integrated into pretrained GAN-inversion networks. We demonstrate that by conditioning the initial inversion step on the CLIP embedding of the target description, we are able to obtain more successful edit directions. Additionally, we use a CLIP-guided refinement step to make corrections in the resulting residual latent codes, which further improves the alignment with the text prompt. Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds, as shown by our qualitative and quantitative results.


Method Overview


An overview of our CLIPInverter approach in comparison to similar text-guided image manipulation methods. StyleCLIP-LM utilizes target description only in the loss function. HairCLIP additionally uses the description to modulate the latent code obtained by the encoder within the mapper. Alternatively, our CLIPInverter employs specially designed adapter layers, CLIPAdapter, to modulate the encoder in extracting the latent code with respect to the target description. To further obtain more accurate edits, it also makes use of an extra refinement module, CLIPRemapper, to make subsequent corrections on the predicted latent code.


CLIPInverter results:

  1. Text-Guided Manipulation
  2. Composition of Facial Attributes
  3. Continuous Manipulation
  4. Image-Guided Manipulation
  5. Manipulations with Unseen Captions
  6. Comparisons Against Other Approaches
  7. Ablation Study

1. Text-Guided Manipulation

Qualitative manipulation results. We show sample text-guided manipulation results on human faces (left), cat images (middle), and bird images (right). Our approach successfully makes local semantic edits based on the target descriptions while keeping the generated outputs faithful to the input images. The images displayed on the left side are the inversion results obtained with the e4e encoder.

2. Composition of Facial Attributes

Manipulations with compositions of facial attributes. We provide example manipulation results where we apply various compositions of several facial attributes as target descriptions.

3. Continuous Manipulation

Continuous manipulation results. We show that starting from the latent code of the original image and walking along the predicted residual latent codes, we can naturally obtain smooth image manipulations, providing control over the end result. For reference, we provide the original (left) and the target descriptions (right) below each row.

4. Image-Guided Manipulation

Image-based manipulation results. Our framework allows for using a reference image as the conditioning input for editing. In the figure, these reference images are given at the top-right. Results on different domains illustrate that our model can transfer the look of the conditioning images to the provided input images.

5. Manipulations with Unseen Captions

Additional manipulation results with out-of-distribution training data. We demonstrate that our CLIPInverter method can perform manipulations with target descriptions involving words never seen during training but semantically similar to the observed ones.

6. Comparisons Against Other Approaches

Comparison against the state-of-the-art text-guided manipulation methods. Our method applies the target edits mentioned in the given descriptions much more accurately than the competing approaches, especially when there are multiple attributes present in the descriptions.

7. Ablation Study

Qualitative results for the ablation study. The global CLIP loss leads to unintuitive and unnatural results. Without perceptual losses, unwanted manipulations occur. Without the cycle pass or CLIPRemapper, we are not able to apply all the desired manipulations.

BibTeX
 @article{CLIPInverter, 
author = {Baykal, Ahmet Canberk and Anees, Abdul Basit and Ceylan, Duygu and Erdem, Erkut and Erdem, Aykut and Yuret, Deniz}, 
title = {CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing}, 
year = {2023}, 
publisher = {Association for Computing Machinery}, 
address = {New York, NY, USA}, 
issn = {0730-0301}, 
url = {https://doi.org/10.1145/3610287}, 
doi = {10.1145/3610287}, 
note = {Just Accepted}, 
journal = {ACM Trans. Graph.}, 
month = {jul}, 
keywords = {Image-to-Image Translation, Generative Adversarial Networks, Image Editing} }

Contact
For any questions, please contact Ahmet Canberk Baykal at canberk.baykal1@gmail.com.