SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models: Supplementary

SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models Burak Can Biner^1,2 Farrin Sofian^1,3 Umur Berkay Karakaş^1,2 Duygu Ceylan⁴ Aykut Erdem^1,2 Erkut Erdem^1,5 KUIS AI Center¹ Koç University² University of California, Irvine³ Adobe Research⁴ Hacettepe University⁵
Paper Code

Abstract

We are witnessing a revolution in conditional image synthesis with the recent success of large scale text-to-image generation methods. This success also opens up new opportunities in controlling the generation and editing process using multi-modal input. While spatial control using cues such as depth, sketch, and other images has attracted a lot of research, we argue that another equally effective modality is audio since sound and sight are two main components of human perception. Hence, we propose a method to enable audio-conditioning in large scale image diffusion models. Our method first maps features obtained from audio clips to tokens that can be injected into the diffusion model in a fashion similar to text tokens. We introduce additional audio-image cross attention layers which we finetune while freezing the weights of the original layers of the diffusion model. In addition to audio conditioned image generation, our method can also be utilized in conjuction with diffusion based editing methods to enable audio conditioned image editing. We demonstrate our method on a wide range of audio and image datasets. We perform extensive comparisons with recent methods and show favorable performance.

1. Model Architecture

In this work, we introduce SonicDiffusion, an approach that steers the process of image generation and editing using auditory inputs. As depicted in the figures below, our proposed approach has two principal components. The first module, termed the Audio Projector, is designed to transform features extracted from an audio clip into a series of inner space tokens. These tokens are subsequently integrated into the image generation model through newly incorporated audio-image cross-attention layers. Crucially, we maintain the original configuration of the image generation model by freezing its existing layer weights. This positions the added cross-attention layers as adapters, serving as a parameter-efficient way to fuse the audio and visual modalities.

Train

Inference

2. Image Generation SonicDiffusion can be used to generate images from audio clips. We show some examples below. Landscape + Into the Wild
Squishing water	Fire crackling	Snow
Waterfall burbling	Forest	Wind
Greatest Hits
Ceramic	Wood	Paper
Metal	Leaf	Water
RAVDESS
Angry	Sad	Happy

3. Multi-Modal Image Generation

SonicDiffusion can generate images from audio clips and text descriptions. We show some examples below.

Aurora Borealis Lights

Sleek Skateboard

Marble Sculpture

Lego Style

Glowing Crystal Ball

Watercolor Wash

4. Image Manipulation SonicDiffusion can be used to edit images using audio clips. We show some examples below. The image on the left is the original image. The image on the right is the edited image. You can change the slider to see the effect of the audio clip on the image. Landscape + Into the Wild

Greatest Hits

5. Sound Interpolation

By the interpolation of the embeddings of two audio clips, SonicDiffusion can generate a sequence of images that smoothly interpolates from one image to another. We show some examples below.