Improve model card and add metadata for TPRU-7B

Hi there! I'm Niels from the Hugging Face community science team. I've opened this PR to improve the model card for TPRU-7B.

Specifically, I have:
- Added metadata including the `image-text-to-text` pipeline tag and the `transformers` library name.
- Linked the model to its base model (`Qwen/Qwen2.5-VL-7B-Instruct`) and relevant datasets (`TPRU-25k`, `TPRU-test`).
- Added a descriptive summary of the model's purpose, key tasks (Temporal Reordering, Next-Frame Prediction, and Previous-Frame Review), and its performance highlights as described in the paper.
- Included the BibTeX citation for the ICLR 2026 paper.

These changes will make the model more discoverable and easier to cite for the community. Let me know if you'd like any adjustments!

Files changed (1) hide show

README.md +55 -3

README.md CHANGED Viewed

@@ -1,3 +1,55 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+library_name: transformers
+pipeline_tag: image-text-to-text
+base_model: Qwen/Qwen2.5-VL-7B-Instruct
+datasets:
+- Stephengzk/TPRU-25k
+- Stephengzk/TPRU-test
+tags:
+- multimodal
+- temporal-reasoning
+- procedural-understanding
+- vision
+- RL
+---
+# TPRU-7B
+TPRU-7B is a multimodal large language model fine-tuned to enhance temporal and procedural visual understanding. It is based on the [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) architecture and optimized using reinforcement learning (specifically Group-wise Preference Optimization or GRPO) on the TPRU dataset.
+The model was introduced in the paper [TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models](https://huggingface.co/papers/2602.18884), which was accepted to ICLR 2026.
+## Model Description
+Multimodal Large Language Models (MLLMs) often struggle with understanding temporal and procedural visual data, which is a bottleneck for real-world embodied AI applications. TPRU (Temporal-Procedural Understanding) addresses this gap by training models on large-scale, procedurally coherent data.
+TPRU-7B is trained to excel in three core temporal reasoning tasks:
+1. **Temporal Reordering:** Reconstructing the correct sequence of shuffled frames.
+2. **Next-Frame Prediction:** Predicting the immediate future state given a sequence.
+3. **Previous-Frame Review:** Deducing the prerequisite state given an outcome.
+Experiments show that TPRU-7B significantly outperforms larger proprietary models like GPT-4o on procedural understanding benchmarks such as MuirBench and Lego-puzzles.
+## Resources
+- **Paper:** [TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models](https://huggingface.co/papers/2602.18884)
+- **Repository:** [GitHub - Stephen-gzk/TPRU](https://github.com/Stephen-gzk/TPRU/)
+- **Datasets:** [TPRU-25k](https://huggingface.co/datasets/Stephengzk/TPRU-25k), [TPRU-test](https://huggingface.co/datasets/Stephengzk/TPRU-test)
+## Citation
+If you find this model or the TPRU dataset useful for your research, please cite the following paper:
+```bibtex
+@inproceedings{gao2026tpru,
+  title={TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models},
+  author={Gao, Zhenkun and Wang, Xuhong and Tan, Xin and Xie, Yuan},
+  booktitle={Published as a conference paper at ICLR 2026},
+  year={2026}
+}
+```
+## Acknowledgements
+The authors thank the developers of [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL), [Easy-R1](https://github.com/hiyouga/EasyR1), and [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) for their open-source contributions.