stevengrove
/

Haplo-7B-Pro-Video

Safetensors

haplo

Model card Files Files and versions

xet

Community

Improve model card with pipeline tag and library name

by nielsr HF Staff - opened Jun 4, 2025

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+33

-8

Files changed (1) hide show

README.md +33 -8

README.md CHANGED Viewed

@@ -1,17 +1,23 @@
 ---
 license: apache-2.0
 ---
 ![Image](assets/logo.jpeg)
 <div align="center">
 # HaploVL - A Single-Transformer Baseline for Multi-Modal Understanding
 [![Project page](https://img.shields.io/badge/Project_page-green)](https://haplo-vl.github.io/)&nbsp;
 </div>
-HaploVL is a multimodal understanding foundation model that delivers comprehensive cross-modal understanding capabilities for text, images, and video inputs through a single transformer architecture.
 ## Highlights
 This repository contains the PyTorch implementation, model weights, and training code for **Haplo**.
@@ -42,9 +48,9 @@ Basic usage example:
 ```python
 from haplo import HaploProcessor, HaploForConditionalGeneration
-processor = HaploProcessor.from_pretrained('stevengrove/Haplo-7B-Pro-Video')
 model = HaploForConditionalGeneration.from_pretrained(
-    'stevengrove/Haplo-7B-Pro-Video',
     torch_dtype=torch.bfloat16
 ).to('cuda')
@@ -65,13 +71,32 @@ outputs = model.generate(inputs)
 print(processor.decode(outputs[0]))
 ```
 ## Acknowledgement
 ```bibtex
-@article{yang2024haplo,
-  title={HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding},
-  author={Yang, Rui and Song, Lin and Xiao, Yicheng and Huang, Runhui and Ge, Yixiao and Shan, Ying and Zhao, Hengshuang},
-  journal={arXiv preprint arXiv:xxxx.xxxxx},
-  year={2025}
 }
 ```

 ---
 license: apache-2.0
+pipeline_tag: any-to-any
+library_name: transformers
 ---
 ![Image](assets/logo.jpeg)
 <div align="center">
 # HaploVL - A Single-Transformer Baseline for Multi-Modal Understanding
+[![arXiv paper](https://img.shields.io/badge/arXiv_paper-red)](http://arxiv.org/abs/2503.14694)&nbsp;
 [![Project page](https://img.shields.io/badge/Project_page-green)](https://haplo-vl.github.io/)&nbsp;
+[![Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/collections/stevengrove/haplo-67d2582ac79d96983fa99697)&nbsp;
+![Tencent ARC Lab](https://img.shields.io/badge/Developed_by-Tencent_ARC_Lab-blue)&nbsp;
 </div>
+HaploVL is a multimodal understanding foundation model that delivers comprehensive cross-modal understanding capabilities for text, images, and video inputs through a single transformer architecture. The model was presented in the paper [HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation](https://huggingface.co/papers/2506.02975).
 ## Highlights
 This repository contains the PyTorch implementation, model weights, and training code for **Haplo**.
 ```python
 from haplo import HaploProcessor, HaploForConditionalGeneration
+processor = HaploProcessor.from_pretrained('stevengrove/Haplo-7B-Pro')
 model = HaploForConditionalGeneration.from_pretrained(
+    'stevengrove/Haplo-7B-Pro',
     torch_dtype=torch.bfloat16
 ).to('cuda')
 print(processor.decode(outputs[0]))
 ```
+### Gradio Demo
+Launch an interactive demo:
+```bash
+python demo/demo.py \
+    -m "stevengrove/Haplo-7B-Pro-Video" \
+    --server-port 8080 \
+    --device cuda \
+    --dtype bfloat16
+```
+**Multi-Modal Capabilities**
+| Category                      | Example                                  |
+|-------------------------------|------------------------------------------|
+| Single Image Understanding    | ![Demo1](assets/demo_1.png)              |
+| Multi-Image Understanding         | ![Demo3](assets/demo_2.png)              |
+| Video Understanding           | ![Demo2](assets/demo_3.png)              |
 ## Acknowledgement
 ```bibtex
+@article{HaploVL,
+    title={HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding},
+    author={Yang, Rui and Song, Lin and Xiao, Yicheng and Huang, Runhui and Ge, Yixiao and Shan, Ying and Zhao, Hengshuang},
+    journal={arXiv preprint arXiv:2503.14694},
+    year={2025}
 }
 ```