TanDiT: Tangent-Plane Diffusion Transformer for High-Quality 360° Panorama Generation

1Koç University, 2Hacettepe University
*Indicates Equal Contribution

Abstract

Recent advances in image generation have led to remarkable improvements in synthesizing perspective images. However, these models still struggle with panoramic image generation due to unique challenges, including varying levels of geometric distortion and the requirement for seamless loop-consistency. To address these issues while leveraging the strengths of the existing models, we introduce TanDiT, a method that synthesizes panoramic scenes by generating grids of tangent-plane images covering the entire 360° view. Unlike previous methods relying on multiple diffusion branches, TanDiT utilizes a unified diffusion model trained to produce these tangent-plane images simultaneously within a single denoising iteration. Furthermore, we propose a model-agnostic post-processing step specifically designed to enhance global coherence across the generated panoramas. To accurately assess panoramic image quality, we also present two specialized metrics, TangentIS and TangentFID, and provide a comprehensive benchmark comprising captioned panoramic datasets and standardized evaluation scripts. Extensive experiments demonstrate that our method generalizes effectively beyond its training data, robustly interprets detailed and complex text prompts, and seamlessly integrates with various generative models to yield high-quality, diverse panoramic images.

Training Overview

Diagram of the TanDiT training process
Our method starts by decomposing a 360° panoramic image into a structured grid of tangent-plane projections via gnomonic projection. These projections are arranged into a single coherent grid image, ensuring adjacent placement of overlapping regions for spatial consistency. Given a dense textual caption describing the scene, the model is trained to reconstruct this grid using a standard denoising diffusion objective in the latent space.

Inference Overview

Diagram of the TanDiT inference process
At inference time, TanDiT first generates a grid of tangent views conditioned on a text prompt. These tangent views are enhanced using a super-resolution module and then reprojected to form an intermediate equirectangular panorama. To further improve global coherence and visual quality, the latent representation of this panorama is perturbed with noise and refined by a pre-trained DiT model, conditioned on the same text input, producing the final high-resolution 360° image.

In-Domain Generated 360° Panoramas

Generation Prompt:
The image depicts a nighttime scene of a European town square. The square is paved with dark, glossy tiles that reflect the lights from the surrounding buildings and street lamps. The buildings are multi-storied and painted in various pastel colors, including shades of pink, yellow, and green. The architecture is traditional, with shuttered windows and balconies adorned with plants and flowers. In the center of the square, there is a fountain with water spouting from its top, surrounded by benches for people to sit and enjoy the view. The square is illuminated by several street lamps that cast a warm glow on the surroundings. There are also some bicycles parked along the edges of the square. The sky above is dark, indicating that it is nighttime. The overall atmosphere of the scene is peaceful and serene, with no people visible in the image.

Click the buttons above to switch between different generated panoramas

Baseline Comparisons

Generation Prompt:
The image depicts a nighttime scene of a European town square. The square is paved with dark, glossy tiles that reflect the lights from the surrounding buildings and street lamps. The buildings are multi-storied and painted in various pastel colors, including shades of pink, yellow, and green. The architecture is traditional, with shuttered windows and balconies adorned with plants and flowers. In the center of the square, there is a fountain with water spouting from its top, surrounded by benches for people to sit and enjoy the view. The square is illuminated by several street lamps that cast a warm glow on the surroundings. There are also some bicycles parked along the edges of the square. The sky above is dark, indicating that it is nighttime. The overall atmosphere of the scene is peaceful and serene, with no people visible in the image.

Click the buttons above to switch between different generated panoramas

Out-of-Domain Generated 360° Panoramas

Generation Prompt:
vibrant fireworks burst across the night sky, painting the heavens with shimmering trails of vivid colors. some fireworks transform into heart shapes before fading, adding a touch of elegance to the display. the camera focuses on explosive arcs and sparkling embers, capturing every brilliant flash against an infinite, celestial canvas. the city's skyline stretches below, clearly visible as vibrant fireworks light up the night sky. the fireworks burst in various colors, scattering across the air, while their reflections shimmer on the glass windows of skyscrapers. the camera smoothly pans across the city, capturing the river and bridges, with distant car lights creating a flowing effect. a breathtaking city skyline stretches below, illuminated by countless lights reflecting off towering skyscrapers. the camera smoothly pans across the landscape, revealing a river winding through the metropolis and bridges glowing under streetlights. distant car headlights flow like streams of light, adding a dynamic rhythm to the urban nightscape.

Click the buttons above to switch between different generated panoramas

Stylized Generated 360° Panoramas

Generation Prompt:
an upward view of the night sky. soft moonlight filters through wispy clouds, casting a serene glow over the winter landscape. from the snowy fields, a view toward a peaceful village nestled among snow-covered hills. warm lights glow from the windows of small wooden cabins, contrasting with the crisp, cold air under the moonlit sky. a high-angle view of snow-covered paths winding through the landscape. the fresh snow glistens under the moonlight, while the warm glow of lanterns and fireplaces reflects off the frosty roads, creating a cozy contrast against the cold night.
Style:
Anime

Click the buttons above to switch between different generated panoramas

BibTeX

@article{capuk2025tandit,
      title={TanDiT: Tangent-Plane Diffusion Transformer for High-Quality 360{\deg} Panorama Generation}, 
      author={Hakan Çapuk and Andrew Bond and Muhammed Burak Kızıl and Emir Göçen and Erkut Erdem and Aykut Erdem},
      year={2025},
      eprint={2506.21681},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.21681}, 
}
}