nielsr HF Staff commited on
Commit
7feb8fe
·
verified ·
1 Parent(s): c1defad

Add metadata and paper/code links

Browse files

This PR improves the model card by adding relevant metadata and links to the official research paper and code repository.

Specifically, it adds:
- `pipeline_tag: video-text-to-text` for better categorization.
- `library_name: transformers` to enable automated code snippets.
- Links to the paper [Factorized Learning for Temporally Grounded Video-Language Models](https://huggingface.co/papers/2512.24097) and the [GitHub repository](https://github.com/nusnlp/d2vlm).
- A BibTeX citation for the research paper.

Files changed (1) hide show
  1. README.md +47 -34
README.md CHANGED
@@ -1,34 +1,47 @@
1
- ---
2
- license: apache-2.0
3
- ---
4
-
5
-
6
- # D2VLM Models
7
-
8
- Here we provided the pre-trained D2VLM models. The performance on the E.T. Bench is shown below.
9
-
10
- | Model Name | Referring (Acc) | Grounding (F1) | Dense Captioning (F1) | Dense Captioning (Sim) | Complex (Recall) |
11
- |---------------------|:---------------:|:--------------:|:---------------------:|:----------------------:|:----------------:|
12
- | D2VLM | 25.3 | 42.3 | 37.5 | 21.8 | 18.1 |
13
- | D2VLM_mcqa_enhanced | 38.3 | 44.3 | 37.2 | 21.4 | 18.6 |
14
-
15
-
16
- ## Some Notes
17
- 1. For the Referring tasks of E.T.Bench (RAR/EVC/RVQ), we adopt a more stringent evaluation protocol compared with the original E.T. Bench, which usually results in lower metric values (e.g., a drop of more than 10% for some existing methods when using our stringent metrics).
18
-
19
- 2. To enhance basic instruction-following capability, we incorporate
20
- automatically constructed multiple-choice questions during
21
- the proposed factorized preference optimization process.
22
- Due to our proposed factorized preference data synthesis,
23
- we can easily generate diverse distractor options based on
24
- different causes of failure and combine them with the original
25
- correct answer to form multiple-choice questions, without
26
- requiring additional external data sources. We define the resulting model as "D2VLM_mcqa_enhanced".
27
-
28
-
29
-
30
-
31
-
32
-
33
-
34
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: video-text-to-text
4
+ library_name: transformers
5
+ ---
6
+
7
+ # D2VLM: Factorized Learning for Temporally Grounded Video-Language Models
8
+
9
+ This repository contains the pre-trained D2VLM models introduced in the paper [Factorized Learning for Temporally Grounded Video-Language Models](https://huggingface.co/papers/2512.24097), accepted at ICCV 2025.
10
+
11
+ D2VLM is a framework that decouples the learning of temporal grounding and textual response in video-language models while emphasizing their inherent dependency. It introduces a "grounding then answering with evidence referencing" paradigm and uses a Factorized Preference Optimization (FPO) algorithm to improve event-level perception.
12
+
13
+ - **Paper:** [Factorized Learning for Temporally Grounded Video-Language Models](https://huggingface.co/papers/2512.24097)
14
+ - **Code:** [GitHub Repository](https://github.com/nusnlp/d2vlm)
15
+
16
+ ## Performance
17
+
18
+ The performance on the E.T. Bench is shown below.
19
+
20
+ | Model Name | Referring (Acc) | Grounding (F1) | Dense Captioning (F1) | Dense Captioning (Sim) | Complex (Recall) |
21
+ |---------------------|:---------------:|:--------------:|:---------------------:|:----------------------:|:----------------:|
22
+ | D2VLM | 25.3 | 42.3 | 37.5 | 21.8 | 18.1 |
23
+ | D2VLM_mcqa_enhanced | 38.3 | 44.3 | 37.2 | 21.4 | 18.6 |
24
+
25
+
26
+ ## Some Notes
27
+ 1. For the Referring tasks of E.T.Bench (RAR/EVC/RVQ), we adopt a more stringent evaluation protocol compared with the original E.T. Bench, which usually results in lower metric values (e.g., a drop of more than 10% for some existing methods when using our stringent metrics).
28
+
29
+ 2. To enhance basic instruction-following capability, we incorporate automatically constructed multiple-choice questions during the proposed factorized preference optimization process. Due to our proposed factorized preference data synthesis, we can easily generate diverse distractor options based on different causes of failure and combine them with the original correct answer to form multiple-choice questions, without requiring additional external data sources. We define the resulting model as "D2VLM_mcqa_enhanced".
30
+
31
+ ## Citation
32
+
33
+ If you find our work useful in your research, please consider citing our paper:
34
+
35
+ ```bibtex
36
+ @inproceedings{d2vlm,
37
+ title={Factorized Learning for Temporally Grounded Video-Language Models},
38
+ author={Zeng, Wenzheng and Gao, Difei and Shou, Mike Zheng and Ng, Hwee Tou},
39
+ booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
40
+ year={2025},
41
+ pages={20683-20693}
42
+ }
43
+ ```
44
+
45
+ ## Acknowledgments
46
+
47
+ This project was built upon [E.T. Bench](https://github.com/PolyU-ChenLab/ETBench), [TimeChat](https://github.com/RenShuhuai-Andy/TimeChat), and [AMP](https://github.com/takomc/amp).