Model Comparison
Hands squeeze an empty plastic water bottle, place the opening over an egg yolk in a bowl, and release the squeeze, sucking the yolk into the bottle, leaving the white behind.
Tested concepts:
fundamental physics (),
object properties & affordances (
),
spatial reasoning (
),
action & procedural understanding (
),
force and motion (
)
Recent progress in text-to-video (T2V) generation has enabled the synthesis of visually compelling and temporally coherent videos from natural language. However, these models often fall short in basic physical commonsense, producing outputs that violate intuitive expectations around causality, object behavior, and tool use. Addressing this gap, we present PhysVidBench, a benchmark designed to evaluate the physical reasoning capabilities of T2V systems. The benchmark includes 383 carefully curated prompts, emphasizing tool use, material properties, and procedural interactions, and domains where physical plausibility is crucial. For each prompt, we generate videos using diverse state-of-the-art models and adopt a three-stage evaluation pipeline: (1) formulate grounded physics questions from the prompt, (2) caption the generated video with a vision-language model, and (3) task a language model to answer several physics-involved questions using only the caption. This indirect strategy circumvents common hallucination issues in direct video-based evaluation. By highlighting affordances and tool-mediated actions, areas overlooked in current T2V evaluations, PhysVidBench provides a structured, interpretable framework for assessing physical commonsense in generative video models.
Physical commonsense dimensions tested in PhysVidBench, illustrated through video examples generated by various state-of-the-art text-to-video (T2V) models. Prompts are designed to probe specific categories such as force and motion, object affordance, spatial containment, temporal progression, and material interaction.
Video Generation Pipeline. An overview of the four-stage pipeline used to construct PhysVidBench. We begin with correct solutions from the PIQA dataset and filter them using a large language model (LLM) to identify instances of secondary tool use and object affordances (Stage 1). These filtered solutions are converted into short, physically grounded video generation prompts via another LLM pass (Stage 2). Each base prompt is further upsampled to enhance its physical specificity and causal structure while preserving the original scene (Stage 3). Both base and upsampled prompts are fed into a diverse set of state-of-the-art text-to-video generation models, resulting in videos that probe tool interaction and physical commonsense (Stage 4).
Evaluation Pipeline. Overview of our three-stage evaluation framework designed to assess physical commonsense in generated videos. Using the upsampled prompts, we generate yes/no questions aligned with one or more of the seven physical commonsense dimensions in our ontology (Stage 1). Each generated video is captioned using AuroraCap, which produces a general-purpose and seven dimension-specific captions aimed at surfacing evidence relevant to different types of physical reasoning (Stage 2). The LLM is prompted to answer the targeted physics question using only the video captions (Stage 3).
Model | AU![]() |
FM![]() |
FP![]() |
MT![]() |
OP![]() |
SR![]() |
TD![]() |
Avg |
---|---|---|---|---|---|---|---|---|
LTX-Video | 20.6 (-4.2) | 19.7 (-3.4) | 18.0 (-2.1) | 16.4 (-1.5) | 24.3 (-4.4) | 17.1 (-2.4) | 14.1 (-4.8) | 20.7 (-2.8) |
VideoCrafter2 | 17.7 (-1.6) | 19.3 (-2.8) | 22.6 (-2.8) | 18.3 (-2.8) | 26.4 (-2.6) | 15.7 (-0.2) | 12.3 (-1.9) | 22.0 (-1.8) |
CogVideoX (2B) | 22.2 (-1.5) | 24.4 (-4.3) | 23.3 (-2.8) | 22.5 (-0.4) | 29.2 (-4.0) | 19.5 (-1.3) | 16.7 (-1.5) | 25.6 (-3.0) |
CogVideoX (5B) | 18.7 (-0.8) | 19.4 (-3.1) | 20.7 (-3.2) | 17.8 (-1.5) | 25.6 (-1.8) | 17.9 (-0.4) | 12.6 (+0.8) | 20.8 (-1.5) |
Wan2.1 (1.3B) | 30.0 (-3.6) | 28.9 (-5.3) | 28.2 (-3.3) | 28.5 (-2.0) | 35.1 (-3.9) | 27.3 (-3.3) | 19.0 (-0.0) | 30.5 (-3.2) |
Wan2.1 (14B) | 33.3 (-3.4) | 31.4 (-1.7) | 31.9 (-1.4) | 32.4 (-3.3) | 39.1 (-4.5) | 29.8 (-1.5) | 23.4 (-1.5) | 33.9 (-1.9) |
MAGI-1 | 27.2 (-6.0) | 26.3 (-3.8) | 28.8 (-1.6) | 27.7 (-4.6) | 37.3 (-5.7) | 30.3 (-5.3) | 19.3 (-8.1) | 32.5 (-5.5) |
Hunyuan Video | 26.4 (-0.6) | 25.8 (+0.5) | 27.7 (-0.1) | 31.9 (-4.8) | 35.7 (-1.9) | 23.9 (+2.2) | 21.6 (-5.2) | 30.1 (-1.2) |
Cosmos (7B) | 32.1 (-8.1) | 32.4 (-9.1) | 32.9 (-8.6) | 35.8 (-11.0) | 39.6 (-10.5) | 31.7 (-10.1) | 21.6 (-8.2) | 34.8 (-8.3) |
Cosmos (14B) | 35.7 (-6.0) | 33.3 (-6.1) | 30.9 (-4.0) | 36.8 (-6.0) | 41.0 (-9.8) | 32.7 (-5.6) | 20.8 (-0.3) | 36.1 (-7.1) |
Sora | 28.6 (+2.4) | 30.0 (-1.1) | 29.9 (+0.2) | 33.1 (+0.3) | 37.0 (-1.1) | 24.8 (+1.7) | 16.7 (+2.6) | 31.4 (+0.3) |
Veo-2 | 34.9 (-0.7) | 33.8 (-2.3) | 31.7 (-2.7) | 36.0 (-3.1) | 38.4 (-1.5) | 29.7 (+1.0) | 19.5 (+2.5) | 34.8 (-0.7) |
AU: Action & Procedural Understanding, FM: Force and Motion, FP: Fundamental Physics, MT: Material Interaction & Transformation, OP: Object Properties & Affordances, SR: Spatial Reasoning, TD: Temporal Dynamics.
We present evaluation results for 12 state-of-the-art video generation models using PhysVidBench, which measures physical commonsense understanding across seven fine-grained dimensions, Action Understanding, Force & Motion, Fundamental Physics, Material Interaction, Object Properties, Spatial Reasoning, and Temporal Dynamics, along with an overall average. Each model is tested under two prompt variants: base prompts derived directly from PIQA answer texts, and upsampled prompts that enrich physical details and causal structure. Across all models, upsampled prompts consistently yield higher scores, particularly in dimensions that benefit from explicit physical context, such as Object Properties, Force & Motion, and Material Interaction. This suggests that prompt engineering remains a critical factor in eliciting stronger physical reasoning from generative models. Despite these gains, most models still underperform in Spatial Reasoning and Temporal Dynamics, reflecting limitations in current architectures' ability to encode geometry, continuity, and causal flow over time. Interestingly, while larger models such as Wan2.1 14B and Cosmos 14B generally achieve higher scores, size alone does not guarantee robust physical understanding highlighting the need for improved objective functions, data, or inductive biases tailored to physical commonsense. Overall, PhysVidBench provides a clear and interpretable framework for diagnosing such gaps and benchmarking progress in physically grounded video generation.