nielsr HF Staff commited on
Commit
2982605
Β·
verified Β·
1 Parent(s): ddc309e

Update model card with metadata and code link

Browse files

Hi! I'm Niels from the community science team at Hugging Face.

This PR improves the model card for VideoLoom by:
- Adding the `video-text-to-text` pipeline tag for better categorization.
- Adding `library_name: transformers` as the model is compatible via `auto_map`.
- Including the official GitHub repository link and a link to the paper.
- Adding a model zoo table and citation information from the official repository.

Files changed (1) hide show
  1. README.md +38 -2
README.md CHANGED
@@ -1,15 +1,51 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
4
 
5
  # VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding
6
 
7
  Jiapeng Shi, [Junke Wang](https://wdrink.github.io/), [Zuyao You](https://scholar.google.com/citations?hl=en&user=X8Kh8uoAAAAJ), [Bo He](https://boheumd.github.io/), [Zuxuan Wu<sup>&#9993;</sup>](https://zxwu.azurewebsites.net/)
8
 
9
- [\[πŸ“œ Paper\]](https://arxiv.org/abs/2601.07290) [\[πŸ“₯ Model\]](https://huggingface.co/collections/JPShi/videoloom)
10
 
11
  ## πŸ”Ž Overview
12
 
13
  This paper presents **VideoLoom**, a unified Video Large Language Model (Video LLM) for joint spatial-temporal understanding. To facilitate the development of fine-grained spatial and temporal localization capabilities, we curate **LoomData-8.7k**, a human-centric video dataset with temporally grounded and spatially localized captions. With this, VideoLoom achieves state-of-the-art or highly competitive performance across a variety of spatial and temporal benchmarks (e.g., 63.1 J&F on ReVOS for referring video object segmentation, and 48.3 R1@0.7 on Charades-STA for temporal grounding). In addition, we introduce **LoomBench**, a novel benchmark consisting of temporal, spatial, and compositional video-question pairs, enabling a comprehensive evaluation of Video LLMs from diverse aspects. Collectively, these contributions offer a universal and effective suite for joint spatial-temporal video understanding, setting a new standard in multimodal intelligence.
14
 
15
- ![Model](assets/model.jpg)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: video-text-to-text
4
+ library_name: transformers
5
  ---
6
 
7
  # VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding
8
 
9
  Jiapeng Shi, [Junke Wang](https://wdrink.github.io/), [Zuyao You](https://scholar.google.com/citations?hl=en&user=X8Kh8uoAAAAJ), [Bo He](https://boheumd.github.io/), [Zuxuan Wu<sup>&#9993;</sup>](https://zxwu.azurewebsites.net/)
10
 
11
+ [\[πŸ“œ Paper\]](https://arxiv.org/abs/2601.07290) [\[πŸ’» Code\]](https://github.com/JPShi12/VideoLoom) [\[πŸ“₯ Model\]](https://huggingface.co/collections/JPShi/videoloom)
12
 
13
  ## πŸ”Ž Overview
14
 
15
  This paper presents **VideoLoom**, a unified Video Large Language Model (Video LLM) for joint spatial-temporal understanding. To facilitate the development of fine-grained spatial and temporal localization capabilities, we curate **LoomData-8.7k**, a human-centric video dataset with temporally grounded and spatially localized captions. With this, VideoLoom achieves state-of-the-art or highly competitive performance across a variety of spatial and temporal benchmarks (e.g., 63.1 J&F on ReVOS for referring video object segmentation, and 48.3 R1@0.7 on Charades-STA for temporal grounding). In addition, we introduce **LoomBench**, a novel benchmark consisting of temporal, spatial, and compositional video-question pairs, enabling a comprehensive evaluation of Video LLMs from diverse aspects. Collectively, these contributions offer a universal and effective suite for joint spatial-temporal video understanding, setting a new standard in multimodal intelligence.
16
 
17
+ ![Model](assets/model.jpg)
18
+
19
+ ## πŸ”₯ News
20
+
21
+ * `Jan. 13, 2026` Our paper and checkpoints are released.
22
+
23
+ ## πŸ“¦ Model Zoo
24
+
25
+ We provide the following models:
26
+ | Model Name | Base MLLM | Checkpoints |
27
+ |:----------:|:-----------------------------------------------------------------:|:----------------------------------------------------:|
28
+ | VideoLoom-4B | [InternVL2.5-4B](https://huggingface.co/OpenGVLab/InternVL2_5-4B) | [πŸ€— link](https://huggingface.co/JPShi/VideoLoom-4B) |
29
+ | VideoLoom-8B | [InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B) | [πŸ€— link](https://huggingface.co/JPShi/VideoLoom-8B) |
30
+
31
+ ## πŸ“œ Citation
32
+
33
+ If you find our work helpful, please consider giving a star ⭐ and citation πŸ“
34
+
35
+ ```bibtex
36
+ @article{shi2026videoloom,
37
+ title={VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding},
38
+ author={Shi, Jiapeng and Wang, Junke and You, Zuyao and He, Bo and Wu, Zuxuan},
39
+ journal={arXiv preprint arXiv:2601.07290},
40
+ year={2026}
41
+ }
42
+ ```
43
+
44
+ ## πŸ“§ Contact
45
+ Feel free to contact us if you have any questions or suggestions
46
+
47
+ - Email (Jiapeng Shi): jpshi1212@gmail.com
48
+
49
+ ## 🀝 Acknowledgements
50
+
51
+ We refer to [Sa2VA](https://github.com/bytedance/Sa2VA) and [TimeChat](https://github.com/RenShuhuai-Andy/TimeChat) to build our codebase. Thanks for their wonderful project.