ViLMA

Video Language Model Assessment
An overview of VILMA. VILMA is introduced with an initial assessment that assesses a model's fundamental comprehension skills, which is then followed by a more advanced primary evaluation designed to gauge its specific temporal reasoning abilities.

What is ViLMA?

ViLMA (Video Language Model Assessment) presents a comprehensive benchmark for Video-Language Models (VidLMs) to evaluate their capabilities in terms of five distinct tests: Counting, Situation Awareness, Change of State, Rare Actions, and Relations.

ViLMA also includes a two stage evaluation procedure as: (i) proficiency test (P) that assess fundamental capabilities deemed essential before solving the five tests, (ii) main test (T) which evaluates the model under the proposed five diverse tests, and (iii) a combined score of these two tasks (P+T).

For more details about ViLMA, please refer to this paper:

Dataset

The ViLMA benchmark encompasses a rich and diverse collection of visio-linguistic scenarios, with a total of 8460 instances across various tests and subtests. These scenarios have undergone meticulous scrutiny, resulting in 5177 valid samples, constituting approximately 61.19% of the dataset. There are five diverse subtests in ViLMA each focuses on different objective to evalute VidLMs.

In the "Change of State" subtest, there are 998 valid samples, while "Action Counting" features 1432 valid instances. The "Rare Actions" subtest includes 1443 valid samples, and "Spatial Relations" has 393 valid instances. In the "Situation Awareness" subtest, there are 911 valid samples.

Please visit our GitHub repository to download the dataset:

Evaluation

To evaluate your model, please use the evaluation scripts shared in the GitHub repository.

Submission

In order to submit your model performance on the Leaderboard, kindly forward an email to Ilker Kesen, containing your model predictions in a specified format shared in GitHub repository along with a concise model description.

Citation

If you use ViLMA in your research, please cite our paper by:

TODO

Licence

In ViLMA, we have meticulously gathered data from a variety of datasets, each of which may come with its specific licensing terms and conditions. For comprehensive information regarding the licensing of the data used in our benchmark, we strongly encourage users to refer to the individual datasets from which the information was sourced.

Leaderboard (Overall)
Rank&Submission Date Model Type Model Code Accuracy (%)
P T P+T
1
Oct 21, 2023
ILM
CLIP
(Radford et al., 2021)
87.27 63.33 56.64
2
Oct 21, 2023
VLM
CLIP4Clip
(Luo et al., 2022)
85.68 64.06 55.22
3
Oct 21, 2023
ILM
BLIP2
(Li et al., 2023)
82.73 65.68 54.92
4
Oct 21, 2023
VLM
VindLU
(Cheng et al., 2023)
84.82 61.03 53.83
5
Oct 21, 2023
VLM
Singularity
(Lei et al., 2022)
83.97 60.18 52.60
6
Oct 21, 2023
VLM
FiT
(Bain et al., 2021)
83.77 60.45 52.13
7
Oct 21, 2023
VLM
Merlot Reserve
(Zellers et al., 2021)
81.84 57.74 50.99
8
Oct 21, 2023
VLM
MCQ
(Ge et al., 2022)
83.21 58.60 50.77
9
Oct 21, 2023
VLM
XCLIP
(Ma et al., 2022)
80.02 61.44 50.62
10
Oct 21, 2023
VLM
VIOLET
(Fu et al., 2021)
81.22 59.81 49.02
11
Oct 21, 2023
VLM
VideoCLIP
(Xu et al., 2021)
70.90 55.56 41.30
12
Oct 21, 2023
VLM
UniVL
(Luo et al., 2020)
71.63 56.41 40.66
13
Oct 21, 2023
VLM
UniPerceiver
(Zhu et al., 2022)
55.74 49.15 26.89
14
Oct 21, 2023
VLM
ClipBERT
(Lei et al., 2021)
51.81 49.41 26.04
15
Oct 21, 2023
LM
GPT-2
(Radford et al., 2019)
45.22 49.30 23.49
16
Oct 21, 2023
LM
OPT
(Zhang et al., 2022)
51.29 41.65 22.22
Leaderboard (Action Counting)
Rank&Submission Date Model Type Model Code Accuracy (%)
P T P+T
1
Oct 21, 2023
VLM
CLIP4Clip
(Luo et al., 2022)
91.20 52.30 47.97
2
Oct 21, 2023
VLM
Merlot Reserve
(Zellers et al., 2021)
84.15 56.01 46.86
3
Oct 21, 2023
VLM
XCLIP
(Ma et al., 2022)
84.10 55.10 46.40
4
Oct 21, 2023
ILM
CLIP
(Radford et al., 2021)
90.50 50.90 46.20
5
Oct 21, 2023
VLM
FiT
(Bain et al., 2021)
83.90 52.40 44.60
6
Oct 21, 2023
ILM
BLIP2
(Li et al., 2023)
80.90 54.50 43.70
7
Oct 21, 2023
VLM
VindLU
(Cheng et al., 2023)
84.50 51.20 43.50
8
Oct 21, 2023
VLM
Singularity
(Lei et al., 2022)
79.60 51.10 41.50
8
Oct 21, 2023
VLM
MCQ
(Ge et al., 2022)
81.40 50.40 41.50
10
Oct 21, 2023
VLM
VIOLET
(Fu et al., 2021)
79.60 50.60 36.50
10
Oct 21, 2023
VLM
VideoCLIP
(Xu et al., 2021)
79.10 46.40 36.50
12
Oct 21, 2023
VLM
UniVL
(Luo et al., 2020)
73.40 43.60 32.20
13
Oct 21, 2023
LM
OPT
(Zhang et al., 2022)
56.20 54.60 31.00
14
Oct 21, 2023
VLM
ClipBERT
(Lei et al., 2021)
56.40 49.60 28.00
15
Oct 21, 2023
LM
GPT-2
(Radford et al., 2019)
50.30 53.30 27.60
16
Oct 21, 2023
VLM
UniPerceiver
(Zhu et al., 2022)
50.56 46.37 22.97
Leaderboard (Situation Awareness)
Rank&Submission Date Model Type Model Code Accuracy (%)
P T P+T
1
Oct 21, 2023
ILM
BLIP2
(Li et al., 2023)
73.43 75.41 55.76
2
Oct 21, 2023
VLM
CLIP4Clip
(Luo et al., 2022)
73.87 49.06 37.65
3
Oct 21, 2023
ILM
CLIP
(Radford et al., 2021)
71.02 45.55 33.69
4
Oct 21, 2023
VLM
VIOLET
(Fu et al., 2021)
70.25 41.60 32.49
5
Oct 21, 2023
VLM
ClipBERT
(Lei et al., 2021)
54.12 56.97 31.94
6
Oct 21, 2023
LM
GPT-2
(Radford et al., 2019)
44.56 66.63 31.72
7
Oct 21, 2023
VLM
VindLU
(Cheng et al., 2023)
70.58 41.60 31.28
8
Oct 21, 2023
VLM
XCLIP
(Ma et al., 2022)
63.55 44.89 31.06
9
Oct 21, 2023
VLM
Singularity
(Lei et al., 2022)
68.82 40.94 30.18
10
Oct 21, 2023
VLM
FiT
(Bain et al., 2021)
69.81 40.06 29.19
11
Oct 21, 2023
VLM
MCQ
(Ge et al., 2022)
67.06 37.10 26.34
12
Oct 21, 2023
VLM
Merlot Reserve
(Zellers et al., 2021)
70.58 35.67 25.35
13
Oct 21, 2023
VLM
VideoCLIP
(Xu et al., 2021)
61.69 40.39 24.91
14
Oct 21, 2023
VLM
UniVL
(Luo et al., 2020)
52.9 46.65 24.14
15
Oct 21, 2023
VLM
UniPerceiver
(Zhu et al., 2022)
51.48 42.15 21.18
16
Oct 21, 2023
LM
OPT
(Zhang et al., 2022)
58.95 23.85 15.80
Leaderboard (Change of State)
Rank&Submission Date Model Type Model Code Accuracy (%)
P T P+T
1
Oct 21, 2023
ILM
CLIP
(Radford et al., 2021)
93.00 55.20 52.20
2
Oct 21, 2023
VLM
CLIP4Clip
(Luo et al., 2022)
94.80 94.80 52.10
3
Oct 21, 2023
VLM
Merlot Reserve
(Zellers et al., 2021)
93.40 53.60 50.40
4
Oct 21, 2023
VLM
Singularity
(Lei et al., 2022)
92.80 54.60 50.30
5
Oct 21, 2023
VLM
VIOLET
(Fu et al., 2021)
88.20 54.60 49.10
6
Oct 21, 2023
VLM
FiT
(Bain et al., 2021)
93.00 52.10 47.80
7
Oct 21, 2023
VLM
XCLIP
(Ma et al., 2022)
85.70 52.70 46.00
8
Oct 21, 2023
VLM
VindLU
(Cheng et al., 2023)
85.40 52.60 45.60
9
Oct 21, 2023
VLM
MCQ
(Ge et al., 2022)
90.30 50.30 45.30
10
Oct 21, 2023
VLM
UniVL
(Luo et al., 2020)
81.30 54.30 43.00
11
Oct 21, 2023
ILM
BLIP2
(Li et al., 2023)
74.50 52.10 38.10
12
Oct 21, 2023
VLM
ClipBERT
(Lei et al., 2021)
63.70 50.00 33.50
15
Oct 21, 2023
VLM
UniPerceiver
(Zhu et al., 2022)
67.50 46.10 29.10
13
Oct 21, 2023
VLM
VideoCLIP
(Xu et al., 2021)
49.80 50.80 25.90
16
Oct 21, 2023
LM
OPT
(Zhang et al., 2022)
23.10 48.00 12.90
6
Oct 21, 2023
LM
GPT-2
(Radford et al., 2019)
18.00 52.40 10.80
Leaderboard (Rare Actions)
Rank&Submission Date Model Type Model Code Accuracy (%)
P T P+T
1
Oct 21, 2023
VLM
VindLU
(Cheng et al., 2023)
94.18 93.07 87.94
2
Oct 21, 2023
ILM
CLIP
(Radford et al., 2021)
92.72 93.90 87.80
3
Oct 21, 2023
VLM
Singularity
(Lei et al., 2022)
92.65 88.84 83.09
4
Oct 21, 2023
VLM
MCQ
(Ge et al., 2022)
91.34 88.70 82.26
5
Oct 21, 2023
VLM
FiT
(Bain et al., 2021)
89.67 89.40 80.73
6
Oct 21, 2023
VLM
CLIP4Clip
(Luo et al., 2022)
82.95 94.10 78.65
7
Oct 21, 2023
VLM
Merlot Reserve
(Zellers et al., 2021)
83.78 80.57 77.61
8
Oct 21, 2023
VLM
VIOLET
(Fu et al., 2021)
87.10 86.63 74.64
9
Oct 21, 2023
VLM
XCLIP
(Ma et al., 2022)
83.85 85.65 72.28
10
Oct 21, 2023
ILM
BLIP2
(Li et al., 2023)
93.83 74.50 70.48
11
Oct 21, 2023
VLM
VideoCLIP
(Xu et al., 2021)
83.99 77.75 67.50
12
Oct 21, 2023
VLM
UniVL
(Luo et al., 2020)
77.48 78.04 59.89
13
Oct 21, 2023
VLM
UniPerceiver
(Zhu et al., 2022)
58.21 58.76 34.71
14
Oct 21, 2023
LM
GPT-2
(Radford et al., 2019)
58.35 25.85 17.67
15
Oct 21, 2023
LM
OPT
(Zhang et al., 2022)
58.97 23.91 14.90
16
Oct 21, 2023
VLM
ClipBERT
(Lei et al., 2021)
39.71 39.78 14.14
Leaderboard (Spatial Relations)
Rank&Submission Date Model Type Model Code Accuracy (%)
P T P+T
1
Oct 21, 2023
ILM
BLIP2
(Li et al., 2023)
91.10 86.00 79.40
2
Oct 21, 2023
LM
OPT
(Zhang et al., 2022)
59.00 84.70 55.70
3
Oct 21, 2023
ILM
CLIP
(Radford et al., 2021)
78.60 58.30 44.80
4
Oct 21, 2023
VLM
CLIP4Clip
(Luo et al., 2022)
79.80 56.70 44.20
5
Oct 21, 2023
VLM
XCLIP
(Ma et al., 2022)
74.80 56.20 43.50
6
Oct 21, 2023
LM
GPT-2
(Radford et al., 2019)
49.10 72.80 43.00
7
Oct 21, 2023
VLM
VideoCLIP
(Xu et al., 2021)
67.90 54.70 39.70
8
Oct 21, 2023
VLM
VindLU
(Cheng et al., 2023)
83.20 45.60 39.40
8
Oct 21, 2023
VLM
MCQ
(Ge et al., 2022)
79.40 48.90 39.40
10
Oct 21, 2023
VLM
Singularity
(Lei et al., 2022)
80.70 46.80 38.90
11
Oct 21, 2023
VLM
VIOLET
(Fu et al., 2021)
73.30 50.40 38.70
11
Oct 21, 2023
VLM
FiT
(Bain et al., 2021)
70.50 51.90 38.70
13
Oct 21, 2023
VLM
UniVL
(Luo et al., 2020)
62.50 51.70 33.20
14
Oct 21, 2023
VLM
ClipBERT
(Lei et al., 2021)
44.00 65.10 30.00
15
Oct 21, 2023
VLM
Merlot Reserve
(Zellers et al., 2021)
63.10 41.90 29.20
16
Oct 21, 2023
VLM
UniPerceiver
(Zhu et al., 2022)
45.50 48.00 20.10