nielsr HF Staff commited on
Commit
00e7af3
·
verified ·
1 Parent(s): e984da2

Improve model card and add metadata for TPRU-7B

Browse files

Hi there! I'm Niels from the Hugging Face community science team. I've opened this PR to improve the model card for TPRU-7B.

Specifically, I have:
- Added metadata including the `image-text-to-text` pipeline tag and the `transformers` library name.
- Linked the model to its base model (`Qwen/Qwen2.5-VL-7B-Instruct`) and relevant datasets (`TPRU-25k`, `TPRU-test`).
- Added a descriptive summary of the model's purpose, key tasks (Temporal Reordering, Next-Frame Prediction, and Previous-Frame Review), and its performance highlights as described in the paper.
- Included the BibTeX citation for the ICLR 2026 paper.

These changes will make the model more discoverable and easier to cite for the community. Let me know if you'd like any adjustments!

Files changed (1) hide show
  1. README.md +55 -3
README.md CHANGED
@@ -1,3 +1,55 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ base_model: Qwen/Qwen2.5-VL-7B-Instruct
6
+ datasets:
7
+ - Stephengzk/TPRU-25k
8
+ - Stephengzk/TPRU-test
9
+ tags:
10
+ - multimodal
11
+ - temporal-reasoning
12
+ - procedural-understanding
13
+ - vision
14
+ - RL
15
+ ---
16
+
17
+ # TPRU-7B
18
+
19
+ TPRU-7B is a multimodal large language model fine-tuned to enhance temporal and procedural visual understanding. It is based on the [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) architecture and optimized using reinforcement learning (specifically Group-wise Preference Optimization or GRPO) on the TPRU dataset.
20
+
21
+ The model was introduced in the paper [TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models](https://huggingface.co/papers/2602.18884), which was accepted to ICLR 2026.
22
+
23
+ ## Model Description
24
+
25
+ Multimodal Large Language Models (MLLMs) often struggle with understanding temporal and procedural visual data, which is a bottleneck for real-world embodied AI applications. TPRU (Temporal-Procedural Understanding) addresses this gap by training models on large-scale, procedurally coherent data.
26
+
27
+ TPRU-7B is trained to excel in three core temporal reasoning tasks:
28
+ 1. **Temporal Reordering:** Reconstructing the correct sequence of shuffled frames.
29
+ 2. **Next-Frame Prediction:** Predicting the immediate future state given a sequence.
30
+ 3. **Previous-Frame Review:** Deducing the prerequisite state given an outcome.
31
+
32
+ Experiments show that TPRU-7B significantly outperforms larger proprietary models like GPT-4o on procedural understanding benchmarks such as MuirBench and Lego-puzzles.
33
+
34
+ ## Resources
35
+
36
+ - **Paper:** [TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models](https://huggingface.co/papers/2602.18884)
37
+ - **Repository:** [GitHub - Stephen-gzk/TPRU](https://github.com/Stephen-gzk/TPRU/)
38
+ - **Datasets:** [TPRU-25k](https://huggingface.co/datasets/Stephengzk/TPRU-25k), [TPRU-test](https://huggingface.co/datasets/Stephengzk/TPRU-test)
39
+
40
+ ## Citation
41
+
42
+ If you find this model or the TPRU dataset useful for your research, please cite the following paper:
43
+
44
+ ```bibtex
45
+ @inproceedings{gao2026tpru,
46
+ title={TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models},
47
+ author={Gao, Zhenkun and Wang, Xuhong and Tan, Xin and Xie, Yuan},
48
+ booktitle={Published as a conference paper at ICLR 2026},
49
+ year={2026}
50
+ }
51
+ ```
52
+
53
+ ## Acknowledgements
54
+
55
+ The authors thank the developers of [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL), [Easy-R1](https://github.com/hiyouga/EasyR1), and [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) for their open-source contributions.