Improve model card
Browse filesThis PR improves the model card by adding the `any-to-any` pipeline tag to the metadata. This ensures the model is correctly categorized and discoverable within the multimodal section of the Hugging Face Hub.
README.md
CHANGED
|
@@ -1,14 +1,16 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
language:
|
| 4 |
- en
|
|
|
|
|
|
|
| 5 |
---
|
|
|
|
| 6 |
# ELLSA: End-to-end Listen, Look, Speak and Act
|
| 7 |
|
| 8 |
<div align="center">
|
| 9 |
|
| 10 |
<div>
|
| 11 |
-
<a href="https://
|
| 12 |
<img src="https://img.shields.io/badge/Paper-arXiv-red.svg" alt="Paper arXiv">
|
| 13 |
</a>
|
| 14 |
<a href="https://github.com/bytedance/SALMONN/tree/ELLSA" target="_blank">
|
|
@@ -27,6 +29,8 @@ language:
|
|
| 27 |
|
| 28 |
The **first** end-to-end model that unifies **vision, speech, text and action** in a **streaming full-duplex** framework, enabling joint multimodal perception and concurrent generation.
|
| 29 |
|
|
|
|
|
|
|
| 30 |
<p align="center">
|
| 31 |
<img src="docs/imgs/ellsa.png" width="60%" height="60%">
|
| 32 |
</p>
|
|
@@ -158,7 +162,7 @@ If you find this project useful, please consider citing our work:
|
|
| 158 |
```bibtex
|
| 159 |
@inproceedings{wang2026end,
|
| 160 |
title={End-to-end Listen, Look, Speak and Act},
|
| 161 |
-
author={Wang, Siyin and Yu, Wenyi and Chen, Xianzhao and Tian, Xiaohai and Zhang, Jun and Lu, Lu and Zhang, Chao},
|
| 162 |
journal={Proc. ICLR},
|
| 163 |
year={2026},
|
| 164 |
address={Rio de Janeiro}
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
pipeline_tag: any-to-any
|
| 6 |
---
|
| 7 |
+
|
| 8 |
# ELLSA: End-to-end Listen, Look, Speak and Act
|
| 9 |
|
| 10 |
<div align="center">
|
| 11 |
|
| 12 |
<div>
|
| 13 |
+
<a href="https://huggingface.co/papers/2510.16756" target="_blank">
|
| 14 |
<img src="https://img.shields.io/badge/Paper-arXiv-red.svg" alt="Paper arXiv">
|
| 15 |
</a>
|
| 16 |
<a href="https://github.com/bytedance/SALMONN/tree/ELLSA" target="_blank">
|
|
|
|
| 29 |
|
| 30 |
The **first** end-to-end model that unifies **vision, speech, text and action** in a **streaming full-duplex** framework, enabling joint multimodal perception and concurrent generation.
|
| 31 |
|
| 32 |
+
ELLSA (End-to-end Listen, Look, Speak and Act) is a full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture.
|
| 33 |
+
|
| 34 |
<p align="center">
|
| 35 |
<img src="docs/imgs/ellsa.png" width="60%" height="60%">
|
| 36 |
</p>
|
|
|
|
| 162 |
```bibtex
|
| 163 |
@inproceedings{wang2026end,
|
| 164 |
title={End-to-end Listen, Look, Speak and Act},
|
| 165 |
+
author={Wang, Siyin and Yu, Wenyi and Chen, Xianzhao and Tian, Xiaohai and Zhang, Jun Bureau and Lu, Lu and Zhang, Chao},
|
| 166 |
journal={Proc. ICLR},
|
| 167 |
year={2026},
|
| 168 |
address={Rio de Janeiro}
|