Improve model card with pipeline tag and library name
#1
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,17 +1,23 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
| 4 |

|
| 5 |
|
| 6 |
<div align="center">
|
| 7 |
|
| 8 |
# HaploVL - A Single-Transformer Baseline for Multi-Modal Understanding
|
| 9 |
|
|
|
|
| 10 |
[](https://haplo-vl.github.io/)
|
|
|
|
|
|
|
| 11 |
|
| 12 |
</div>
|
| 13 |
|
| 14 |
-
HaploVL is a multimodal understanding foundation model that delivers comprehensive cross-modal understanding capabilities for text, images, and video inputs through a single transformer architecture.
|
| 15 |
|
| 16 |
## Highlights
|
| 17 |
This repository contains the PyTorch implementation, model weights, and training code for **Haplo**.
|
|
@@ -42,9 +48,9 @@ Basic usage example:
|
|
| 42 |
```python
|
| 43 |
from haplo import HaploProcessor, HaploForConditionalGeneration
|
| 44 |
|
| 45 |
-
processor = HaploProcessor.from_pretrained('stevengrove/Haplo-7B-Pro
|
| 46 |
model = HaploForConditionalGeneration.from_pretrained(
|
| 47 |
-
'stevengrove/Haplo-7B-Pro
|
| 48 |
torch_dtype=torch.bfloat16
|
| 49 |
).to('cuda')
|
| 50 |
|
|
@@ -65,13 +71,32 @@ outputs = model.generate(inputs)
|
|
| 65 |
print(processor.decode(outputs[0]))
|
| 66 |
```
|
| 67 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
## Acknowledgement
|
| 69 |
|
| 70 |
```bibtex
|
| 71 |
-
@article{
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
}
|
| 77 |
```
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
pipeline_tag: any-to-any
|
| 4 |
+
library_name: transformers
|
| 5 |
---
|
| 6 |
+
|
| 7 |

|
| 8 |
|
| 9 |
<div align="center">
|
| 10 |
|
| 11 |
# HaploVL - A Single-Transformer Baseline for Multi-Modal Understanding
|
| 12 |
|
| 13 |
+
[](http://arxiv.org/abs/2503.14694)
|
| 14 |
[](https://haplo-vl.github.io/)
|
| 15 |
+
[](https://huggingface.co/collections/stevengrove/haplo-67d2582ac79d96983fa99697)
|
| 16 |
+

|
| 17 |
|
| 18 |
</div>
|
| 19 |
|
| 20 |
+
HaploVL is a multimodal understanding foundation model that delivers comprehensive cross-modal understanding capabilities for text, images, and video inputs through a single transformer architecture. The model was presented in the paper [HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation](https://huggingface.co/papers/2506.02975).
|
| 21 |
|
| 22 |
## Highlights
|
| 23 |
This repository contains the PyTorch implementation, model weights, and training code for **Haplo**.
|
|
|
|
| 48 |
```python
|
| 49 |
from haplo import HaploProcessor, HaploForConditionalGeneration
|
| 50 |
|
| 51 |
+
processor = HaploProcessor.from_pretrained('stevengrove/Haplo-7B-Pro')
|
| 52 |
model = HaploForConditionalGeneration.from_pretrained(
|
| 53 |
+
'stevengrove/Haplo-7B-Pro',
|
| 54 |
torch_dtype=torch.bfloat16
|
| 55 |
).to('cuda')
|
| 56 |
|
|
|
|
| 71 |
print(processor.decode(outputs[0]))
|
| 72 |
```
|
| 73 |
|
| 74 |
+
### Gradio Demo
|
| 75 |
+
Launch an interactive demo:
|
| 76 |
+
```bash
|
| 77 |
+
python demo/demo.py \
|
| 78 |
+
-m "stevengrove/Haplo-7B-Pro-Video" \
|
| 79 |
+
--server-port 8080 \
|
| 80 |
+
--device cuda \
|
| 81 |
+
--dtype bfloat16
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
**Multi-Modal Capabilities**
|
| 85 |
+
|
| 86 |
+
| Category | Example |
|
| 87 |
+
|-------------------------------|------------------------------------------|
|
| 88 |
+
| Single Image Understanding |  |
|
| 89 |
+
| Multi-Image Understanding |  |
|
| 90 |
+
| Video Understanding |  |
|
| 91 |
+
|
| 92 |
+
|
| 93 |
## Acknowledgement
|
| 94 |
|
| 95 |
```bibtex
|
| 96 |
+
@article{HaploVL,
|
| 97 |
+
title={HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding},
|
| 98 |
+
author={Yang, Rui and Song, Lin and Xiao, Yicheng and Huang, Runhui and Ge, Yixiao and Shan, Ying and Zhao, Hengshuang},
|
| 99 |
+
journal={arXiv preprint arXiv:2503.14694},
|
| 100 |
+
year={2025}
|
| 101 |
}
|
| 102 |
```
|