Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360° Videos

Omnidirectional videos (ODVs) are redefining viewer experiences in virtual reality (VR) by offering an unprecedented full field-of-view (FOV). This study extends the domain of saliency prediction to 360◦ environments, addressing the complexities of spherical distortion and the integration of spatial audio. Contextually, ODVs have transformed user experience by adding a spatial audio dimension that aligns sound direction with the viewer’s perspective in spherical scenes. Motivated by the lack of comprehensive datasets for 360◦ audio-visual saliency prediction, our study curates YT360-EyeTracking, a new dataset of 81 ODVs, each observed under varying audio-visual conditions. Our goal is to explore how to utilize audio-visual cues to effectively predict visual saliency in 360◦ videos. Towards this aim, we propose two novel saliency prediction models: SalViT360, a vision-transformer-based framework for ODVs equipped with spherical geometry-aware spatio-temporal attention layers, and SalViT360-AV, which further incorporates transformer adapters conditioned on audio input. Our results on a number of benchmark datasets, including our YT360-EyeTracking, demonstrate that SalViT360 and SalViT360-AV significantly outperform existing methods in predicting viewer attention in 360◦ scenes. Interpreting these results, we suggest that integrating spatial audio cues in the model architecture is crucial for accurate saliency prediction in omnidirectional videos.

Model Overview

Overview of the proposed SalViT360-AV pipeline. We use the SalViT360 [1] model as the video saliency module (top), and our implementation allows for any audio model as the audio backbone (bottom). The audio stream takes input spatial audio waveforms \(x_{aud} \in \mathbb{R}^{4 \times N}\) encoded as first-order ambisonics in 4-channel B-format. To simulate what the subjects are hearing while looking at a particular location, we rotate the ambisonics depending on the angular coordinates \((\theta, \phi)\) for each tangent viewport \(\{x_t\}_{T}^{t=1}\) (1). The rotated waveforms are mono, which enables us to use any pre-trained audio backbone for feature extraction (2). The extracted features are passed to the adapter layers in each upgraded transformer block (3) for audio-visual tuning. While the total number of parameters in the video pipeline is 37M, the additional adapter layers require only 600k parameters for fine-tuning.

YT360-EyeTracking Dataset

We curated a subset of 360° videos with first order ambisonics from YT-360 Dataset containing equal amounts of clips from each scene type (indoor, outdoor-natural, and outdoor-manmade) and audio type (human speech, music instrument, and vehicle sounds). Each category combination has 9 clips, totaling 81 clips.

	@misc{cokelek2023spherical,
		title={Spherical Vision Transformer for 360-degree Video Saliency Prediction}, 
		author={Mert Cokelek and Nevrez Imamoglu and Cagri Ozcinar and Erkut Erdem and Aykut Erdem},
		year={2023},
		eprint={2308.13004},
		archivePrefix={arXiv},
		primaryClass={cs.CV}
	}