SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models

Burak Can Biner1,2   Farrin Sofian1,3   Umur Berkay Karakaş1,2   Duygu Ceylan4   Aykut Erdem1,2   Erkut Erdem1,5

KUIS AI Center1   Koç University2   University of California, Irvine3   Adobe Research4   Hacettepe University5

Paper Code

Abstract

We are witnessing a revolution in conditional image synthesis with the recent success of large scale text-to-image generation methods. This success also opens up new opportunities in controlling the generation and editing process using multi-modal input. While spatial control using cues such as depth, sketch, and other images has attracted a lot of research, we argue that another equally effective modality is audio since sound and sight are two main components of human perception. Hence, we propose a method to enable audio-conditioning in large scale image diffusion models. Our method first maps features obtained from audio clips to tokens that can be injected into the diffusion model in a fashion similar to text tokens. We introduce additional audio-image cross attention layers which we finetune while freezing the weights of the original layers of the diffusion model. In addition to audio conditioned image generation, our method can also be utilized in conjuction with diffusion based editing methods to enable audio conditioned image editing. We demonstrate our method on a wide range of audio and image datasets. We perform extensive comparisons with recent methods and show favorable performance.

1. Model Architecture

In this work, we introduce  SonicDiffusion, an approach that steers the process of image generation and editing using auditory inputs. As depicted in the figures below, our proposed approach has two principal components. The first module, termed the Audio Projector, is designed to transform features extracted from an audio clip into a series of inner space tokens. These tokens are subsequently integrated into the image generation model through newly incorporated audio-image cross-attention layers. Crucially, we maintain the original configuration of the image generation model by freezing its existing layer weights. This positions the added cross-attention layers as adapters, serving as a parameter-efficient way to fuse the audio and visual modalities.

Train


Inference


2. Image Generation

 SonicDiffusion can be used to generate images from audio clips. We show some examples below.

Landscape + Into the Wild


  Squishing water  


  Fire crackling  


  Snow  


  Waterfall burbling  


  Forest  


  Wind  

Greatest Hits


  Ceramic  


  Wood  


  Paper  


  Metal  


  Leaf  


  Water  

RAVDESS


  Angry  


  Sad  


  Happy  

3. Multi-Modal Image Generation

 SonicDiffusion can generate images from audio clips and text descriptions. We show some examples below.


 Aurora Borealis Lights


 Sleek Skateboard


 Marble Sculpture


 Lego Style


 Glowing Crystal Ball


 Watercolor Wash

4. Image Manipulation

 SonicDiffusion can be used to edit images using audio clips. We show some examples below. The image on the left is the original image. The image on the right is the edited image. You can change the slider to see the effect of the audio clip on the image.

Landscape + Into the Wild

Greatest Hits

5. Sound Interpolation

By the interpolation of the embeddings of two audio clips,  SonicDiffusion can generate a sequence of images that smoothly interpolates from one image to another. We show some examples below.



6. Volume Adjustment

 SonicDiffusion can generate images according to the intensity of the input volume in the audio clip. We show some examples below.