PhysVidBench : Can Your Model Separate Yolks with a Water Bottle?
Physical Commonsense in Video Generation Models

Baris Sarper Tezcan^1*, Enes Sanli^1*, Erkut Erdem², Aykut Erdem¹

(* equal contributions)

¹ Department of Computer Engineering, Koç University, Istanbul, Turkey
² Department of Computer Engineering, Hacettepe University, Ankara, Turkey

Paper (PhysVidBench) GitHub

Huggingface Dataset

Force and Motion

Prompt: Conditioner is applied to a finger with a stuck ring, allowing the ring to slide off smoothly.

Fundamental Physics

Prompt: Hands place several sheets of white styrofoam into the bottom of a terracotta planter.

Material Interaction and Transformation

Prompt: A person attaches a binder clip to the end of a toothpaste tube and squeezes it to get more toothpaste out.

Object Properties and Affordances

Prompt: Hands stuff dryer lint into cardboard toilet paper tubes.

Spatial Reasoning

Prompt: Toothpaste is smeared onto the inner lens of swimming goggles and then wiped away with a soft cloth.

Temporal Dynamics

Prompt: Toothpicks are pushed through slices of bread, securing them against the cut edges of a leftover cake.

Action & Procedural Understanding

Prompt: Hands tear pages from a book and place the crumpled pages under logs in a fire pit.

Physical commonsense dimensions tested in PhysVidBench, illustrated through video examples generated by various state-of-the-art text-to-video (T2V) models. Prompts are designed to probe specific categories such as force and motion, object affordance, spatial containment, temporal progression, and material interaction.

Video Generation Pipeline. An overview of the four-stage pipeline used to construct PhysVidBench. We begin with correct solutions from the PIQA dataset and filter them using a large language model (LLM) to identify instances of secondary tool use and object affordances (Stage 1). These filtered solutions are converted into short, physically grounded video generation prompts via another LLM pass (Stage 2). Each base prompt is further upsampled to enhance its physical specificity and causal structure while preserving the original scene (Stage 3). Both base and upsampled prompts are fed into a diverse set of state-of-the-art text-to-video generation models, resulting in videos that probe tool interaction and physical commonsense (Stage 4).

Evaluation Pipeline. Overview of our three-stage evaluation framework designed to assess physical commonsense in generated videos. Using the upsampled prompts, we generate yes/no questions aligned with one or more of the seven physical commonsense dimensions in our ontology (Stage 1). Each generated video is captioned using AuroraCap, which produces a general-purpose and seven dimension-specific captions aimed at surfacing evidence relevant to different types of physical reasoning (Stage 2). The LLM is prompted to answer the targeted physics question using only the video captions (Stage 3).

Model	AU	FM	FP	MT	OP	SR	TD	Avg
LTX-Video	20.6 (-4.2)	19.7 (-3.4)	18.0 (-2.1)	16.4 (-1.5)	24.3 (-4.4)	17.1 (-2.4)	14.1 (-4.8)	20.7 (-2.8)
VideoCrafter2	17.7 (-1.6)	19.3 (-2.8)	22.6 (-2.8)	18.3 (-2.8)	26.4 (-2.6)	15.7 (-0.2)	12.3 (-1.9)	22.0 (-1.8)
CogVideoX (2B)	22.2 (-1.5)	24.4 (-4.3)	23.3 (-2.8)	22.5 (-0.4)	29.2 (-4.0)	19.5 (-1.3)	16.7 (-1.5)	25.6 (-3.0)
CogVideoX (5B)	18.7 (-0.8)	19.4 (-3.1)	20.7 (-3.2)	17.8 (-1.5)	25.6 (-1.8)	17.9 (-0.4)	12.6 (+0.8)	20.8 (-1.5)
Wan2.1 (1.3B)	30.0 (-3.6)	28.9 (-5.3)	28.2 (-3.3)	28.5 (-2.0)	35.1 (-3.9)	27.3 (-3.3)	19.0 (-0.0)	30.5 (-3.2)
Wan2.1 (14B)	33.3 (-3.4)	31.4 (-1.7)	31.9 (-1.4)	32.4 (-3.3)	39.1 (-4.5)	29.8 (-1.5)	23.4 (-1.5)	33.9 (-1.9)
MAGI-1	27.2 (-6.0)	26.3 (-3.8)	28.8 (-1.6)	27.7 (-4.6)	37.3 (-5.7)	30.3 (-5.3)	19.3 (-8.1)	32.5 (-5.5)
Hunyuan Video	26.4 (-0.6)	25.8 (+0.5)	27.7 (-0.1)	31.9 (-4.8)	35.7 (-1.9)	23.9 (+2.2)	21.6 (-5.2)	30.1 (-1.2)
Cosmos (7B)	32.1 (-8.1)	32.4 (-9.1)	32.9 (-8.6)	35.8 (-11.0)	39.6 (-10.5)	31.7 (-10.1)	21.6 (-8.2)	34.8 (-8.3)
Cosmos (14B)	35.7 (-6.0)	33.3 (-6.1)	30.9 (-4.0)	36.8 (-6.0)	41.0 (-9.8)	32.7 (-5.6)	20.8 (-0.3)	36.1 (-7.1)
Sora	28.6 (+2.4)	30.0 (-1.1)	29.9 (+0.2)	33.1 (+0.3)	37.0 (-1.1)	24.8 (+1.7)	16.7 (+2.6)	31.4 (+0.3)
Veo-2	34.9 (-0.7)	33.8 (-2.3)	31.7 (-2.7)	36.0 (-3.1)	38.4 (-1.5)	29.7 (+1.0)	19.5 (+2.5)	34.8 (-0.7)

AU: Action & Procedural Understanding, FM: Force and Motion, FP: Fundamental Physics, MT: Material Interaction & Transformation, OP: Object Properties & Affordances, SR: Spatial Reasoning, TD: Temporal Dynamics.

We present evaluation results for 12 state-of-the-art video generation models using PhysVidBench, which measures physical commonsense understanding across seven fine-grained dimensions, Action Understanding, Force & Motion, Fundamental Physics, Material Interaction, Object Properties, Spatial Reasoning, and Temporal Dynamics, along with an overall average. Each model is tested under two prompt variants: base prompts derived directly from PIQA answer texts, and upsampled prompts that enrich physical details and causal structure. Across all models, upsampled prompts consistently yield higher scores, particularly in dimensions that benefit from explicit physical context, such as Object Properties, Force & Motion, and Material Interaction. This suggests that prompt engineering remains a critical factor in eliciting stronger physical reasoning from generative models. Despite these gains, most models still underperform in Spatial Reasoning and Temporal Dynamics, reflecting limitations in current architectures' ability to encode geometry, continuity, and causal flow over time. Interestingly, while larger models such as Wan2.1 14B and Cosmos 14B generally achieve higher scores, size alone does not guarantee robust physical understanding highlighting the need for improved objective functions, data, or inductive biases tailored to physical commonsense. Overall, PhysVidBench provides a clear and interpretable framework for diagnosing such gaps and benchmarking progress in physically grounded video generation.

D3 Radar Chart – Tooltip + Alt Legend + Highlight

PhysVidBench : Can Your Model Separate Yolks with a Water Bottle?
Physical Commonsense in Video Generation Models

Real-World

Synthetically Generated

Abstract

Physical Commonsense Dimensions

Demo Video

Video Generation Pipeline

Evaluation Pipeline

PhysVidBench Evaluation Results

PhysVidBench : Can Your Model Separate Yolks with a Water Bottle?Physical Commonsense in Video Generation Models

Real-World

Synthetically Generated

Abstract

Physical Commonsense Dimensions

Demo Video

Video Generation Pipeline

Evaluation Pipeline

PhysVidBench Evaluation Results

PhysVidBench : Can Your Model Separate Yolks with a Water Bottle?
Physical Commonsense in Video Generation Models