Safetensors
English
llama
nielsr HF Staff commited on
Commit
cfc523f
·
verified ·
1 Parent(s): e819518

Improve model card

Browse files

This PR improves the model card by adding the `any-to-any` pipeline tag to the metadata. This ensures the model is correctly categorized and discoverable within the multimodal section of the Hugging Face Hub.

Files changed (1) hide show
  1. README.md +7 -3
README.md CHANGED
@@ -1,14 +1,16 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
  - en
 
 
5
  ---
 
6
  # ELLSA: End-to-end Listen, Look, Speak and Act
7
 
8
  <div align="center">
9
 
10
  <div>
11
- <a href="https://arxiv.org/pdf/2510.16756" target="_blank">
12
  <img src="https://img.shields.io/badge/Paper-arXiv-red.svg" alt="Paper arXiv">
13
  </a>
14
  <a href="https://github.com/bytedance/SALMONN/tree/ELLSA" target="_blank">
@@ -27,6 +29,8 @@ language:
27
 
28
  The **first** end-to-end model that unifies **vision, speech, text and action** in a **streaming full-duplex** framework, enabling joint multimodal perception and concurrent generation.
29
 
 
 
30
  <p align="center">
31
  <img src="docs/imgs/ellsa.png" width="60%" height="60%">
32
  </p>
@@ -158,7 +162,7 @@ If you find this project useful, please consider citing our work:
158
  ```bibtex
159
  @inproceedings{wang2026end,
160
  title={End-to-end Listen, Look, Speak and Act},
161
- author={Wang, Siyin and Yu, Wenyi and Chen, Xianzhao and Tian, Xiaohai and Zhang, Jun and Lu, Lu and Zhang, Chao},
162
  journal={Proc. ICLR},
163
  year={2026},
164
  address={Rio de Janeiro}
 
1
  ---
 
2
  language:
3
  - en
4
+ license: apache-2.0
5
+ pipeline_tag: any-to-any
6
  ---
7
+
8
  # ELLSA: End-to-end Listen, Look, Speak and Act
9
 
10
  <div align="center">
11
 
12
  <div>
13
+ <a href="https://huggingface.co/papers/2510.16756" target="_blank">
14
  <img src="https://img.shields.io/badge/Paper-arXiv-red.svg" alt="Paper arXiv">
15
  </a>
16
  <a href="https://github.com/bytedance/SALMONN/tree/ELLSA" target="_blank">
 
29
 
30
  The **first** end-to-end model that unifies **vision, speech, text and action** in a **streaming full-duplex** framework, enabling joint multimodal perception and concurrent generation.
31
 
32
+ ELLSA (End-to-end Listen, Look, Speak and Act) is a full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture.
33
+
34
  <p align="center">
35
  <img src="docs/imgs/ellsa.png" width="60%" height="60%">
36
  </p>
 
162
  ```bibtex
163
  @inproceedings{wang2026end,
164
  title={End-to-end Listen, Look, Speak and Act},
165
+ author={Wang, Siyin and Yu, Wenyi and Chen, Xianzhao and Tian, Xiaohai and Zhang, Jun Bureau and Lu, Lu and Zhang, Chao},
166
  journal={Proc. ICLR},
167
  year={2026},
168
  address={Rio de Janeiro}