What is ViLMA?
ViLMA (Video Language Model Assessment) presents a comprehensive benchmark for Video-Language Models (VidLMs) to evaluate their capabilities in terms of five distinct tests: Counting, Situation Awareness, Change of State, Rare Actions, and Relations.
ViLMA also includes a two stage evaluation procedure as: (i) proficiency test (P) that assess fundamental capabilities deemed essential before solving the five tests, (ii) main test (T) which evaluates the model under the proposed five diverse tests, and (iii) a combined score of these two tasks (P+T).
For more details about ViLMA, please refer to this paper:
The ViLMA benchmark encompasses a rich and diverse collection of visio-linguistic scenarios, with a total of 8460 instances across various tests and subtests. These scenarios have undergone meticulous scrutiny, resulting in 5177 valid samples, constituting approximately 61.19% of the dataset. There are five diverse subtests in ViLMA each focuses on different objective to evalute VidLMs.
In the "Change of State" subtest, there are 998 valid samples, while "Action Counting" features 1432 valid instances. The "Rare Actions" subtest includes 1443 valid samples, and "Spatial Relations" has 393 valid instances. In the "Situation Awareness" subtest, there are 911 valid samples.
Please visit our GitHub repository to download the dataset:
To evaluate your model, please use the evaluation scripts shared in the GitHub repository.
In order to submit your model performance on the Leaderboard, kindly forward an email to Ilker Kesen, containing your model predictions in a specified format shared in GitHub repository along with a concise model description.
If you use ViLMA in your research, please cite our paper by:
In ViLMA, we have meticulously gathered data from a variety of datasets, each of which may come with its specific licensing terms and conditions. For comprehensive information regarding the licensing of the data used in our benchmark, we strongly encourage users to refer to the individual datasets from which the information was sourced.