HyperGAN-CLIP

HyperGAN-CLIP: A Unified Framework for Domain Adaptation, Image Synthesis and Manipulation


Abstract

Generative Adversarial Networks (GANs), particularly StyleGAN and its variants, have demonstrated remarkable capabilities in generating highly realistic images. Despite their success, adapting these models to diverse tasks such as domain adaptation, reference-guided synthesis, and text-guided manipulation with limited training data remains challenging. Towards this end, in this study, we present a novel framework that significantly extends the capabilities of a pre-trained StyleGAN by integrating CLIP space via hypernetworks. This integration allows dynamic adaptation of StyleGAN to new domains defined by reference images or textual descriptions. Additionally, we introduce a CLIP-guided discriminator that enhances the alignment between generated images and target domains, ensuring superior image quality. Our approach demonstrates unprecedented flexibility, enabling textguided image manipulation without the need for text-specific training data and facilitating seamless style transfer. Comprehensive qualitative and quantitative evaluations confirm the robustness and superior performance of our framework compared to existing methods


Method Overview


Overview of HyperGAN-CLIP This framework employs hypernetwork modules to adjust StyleGAN generator weights based on images or text prompts. These inputs facilitate domain adaptation, attribute transfer, or image editing. The modulated weights blend with original features to produce images that align with specified domains or tasks like reference-guided synthesis and text-guided manipulation, while maintaining source integrity.


HyperGAN-CLIP results:

  1. HyperGAN-CLIP Applications
  2. Qualitative Comparisons - Domain Adaptation
  3. Domain Mixing
  4. Semantic Editing in Target Domains
  5. Qualitative Comparisons - Reference-Guided Image Synthesis
  6. Reference-Guided Synthesis with Mixed Embeddings
  7. Reference-Guided Synthesis on Real Images
  8. Qualitative Comparisons - Text-Guided Image Manipulation
  9. Text-Guided Image Manipulation on Real Images

1. HyperGAN-CLIP Applications

HyperGAN-CLIP and its Applications. Introducing HyperGAN-CLIP, a flexible framework that enhances the capabilities of a pre-trained StyleGAN model for a multitude of tasks, including multiple domain one-shot adaptation, reference-guided image synthesis and text-guided image manipulation. Our method pushes the boundaries of image synthesis and editing, enabling users to create diverse and high-quality images with remarkable ease and precision.

2. Qualitative Comparisons - Domain Adaptation

Comparison against the state-of-the-art few-shot domain adaptation methods. Our proposed HyperGAN-CLIP model outperforms competing methods in accurately capturing the visual characteristics of the target domains.

3. Domain Mixing

Domain mixing. Our approach can fuse multiple domains to create novel compositions. By averaging and re-scaling the CLIP embeddings of two target domains, we can generate images that blend characteristics from both.

4. Semantic Editing in Target Domains

Semantic editing in target domains. Since latent mapper is kept intact, our approach allows for using existing latent space discovery methods to perform semantic edits. We manipulate two sample face images from adapted domains by playing with age, smile, and pose using InterfaceGAN.

5. Qualitative Comparisons - Reference-Guided Image Synthesis

Comparison with state-of-the-art reference-guided image synthesis approaches. Our approach effectively transfers the style of the target image to the source image while effectively preserving identity compared to competing methods.

6. Reference-Guided Synthesis with Mixed Embeddings

Reference-guided image synthesis with mixed embeddings. Each row shows the input image, the initial result with the CLIP image embedding, the refined result with a mixed embedding that incorporates the target attribute with α=0.5, and the reference image, respectively. Target text attributes are beard (top row), black hair (middle row), and smiling (bottom row). Incorporating mixed modality embeddings results in more accurate and detailed image modifications.

7. Reference-Guided Synthesis on Real Images

Reference-guided image synthesis on real images. Our model can effectively transfer the style of the target image to the source image while preserving the identity of the source image. The results demonstrate the robustness of our model in handling real images.

7. Qualitative Comparisons - Text-Guided Image Manipulation

Comparisons with state-of-the-art text-guided image manipulation methods. Our model shows remarkable versality in manipulating images across a diverse range of textual descriptions. The results vividly illustrate our model's ability to accurately apply changes based on target descriptions encompassing both single and multiple attributes. Compared to the competing approaches, our model preserves the identity of the input much better while successfully executing the desired manipulations.

8. Text-Guided Image Manipulation on Real Images

Text-guided image manipulation on real images. Our model can effectively manipulate real images based on textual descriptions. The results demonstrate the robustness of our model in handling real images and executing the desired manipulations.

BibTeX
 @inproceedings{Anees2024HyperGANCLIP,
    title     = {HyperGAN-CLIP: A Unified Framework for Domain Adaptation, Image Synthesis and Manipulation},
    author    = {Abdul Basit Anees and Ahmet Canberk Baykal and Duygu Ceylan and Aykut Erdem and Erkut Erdem and Muhammed Burak Kızıl},
    booktitle = {Proceedings of the ACM (SIGGRAPH Asia)},
    year      = {2024}
}

Contact
For any questions, please contact Abdul Basit Anees at abdulbasitanees98@gmail.com.