Improve model card

This PR improves the model card by adding the `any-to-any` pipeline tag to the metadata. This ensures the model is correctly categorized and discoverable within the multimodal section of the Hugging Face Hub.

Files changed (1) hide show

README.md +7 -3

README.md CHANGED Viewed

@@ -1,14 +1,16 @@
 ---
-license: apache-2.0
 language:
 - en
 ---
 # ELLSA: End-to-end Listen, Look, Speak and Act
 <div align="center">
 <div>
-    <a href="https://arxiv.org/pdf/2510.16756" target="_blank">
       <img src="https://img.shields.io/badge/Paper-arXiv-red.svg" alt="Paper arXiv">
     </a>
     <a href="https://github.com/bytedance/SALMONN/tree/ELLSA" target="_blank">
@@ -27,6 +29,8 @@ language:
 The **first** end-to-end model that unifies **vision, speech, text and action** in a **streaming full-duplex** framework, enabling joint multimodal perception and concurrent generation.
 <p align="center">
     <img src="docs/imgs/ellsa.png" width="60%" height="60%">
 </p>
@@ -158,7 +162,7 @@ If you find this project useful, please consider citing our work:
 ```bibtex
 @inproceedings{wang2026end,
   title={End-to-end Listen, Look, Speak and Act},
-  author={Wang, Siyin and Yu, Wenyi and Chen, Xianzhao and Tian, Xiaohai and Zhang, Jun and Lu, Lu and Zhang, Chao},
   journal={Proc. ICLR},
   year={2026},
   address={Rio de Janeiro}

 ---
 language:
 - en
+license: apache-2.0
+pipeline_tag: any-to-any
 ---
 # ELLSA: End-to-end Listen, Look, Speak and Act
 <div align="center">
 <div>
+    <a href="https://huggingface.co/papers/2510.16756" target="_blank">
       <img src="https://img.shields.io/badge/Paper-arXiv-red.svg" alt="Paper arXiv">
     </a>
     <a href="https://github.com/bytedance/SALMONN/tree/ELLSA" target="_blank">
 The **first** end-to-end model that unifies **vision, speech, text and action** in a **streaming full-duplex** framework, enabling joint multimodal perception and concurrent generation.
+ELLSA (End-to-end Listen, Look, Speak and Act) is a full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture.
 <p align="center">
     <img src="docs/imgs/ellsa.png" width="60%" height="60%">
 </p>
 ```bibtex
 @inproceedings{wang2026end,
   title={End-to-end Listen, Look, Speak and Act},
+  author={Wang, Siyin and Yu, Wenyi and Chen, Xianzhao and Tian, Xiaohai and Zhang, Jun Bureau and Lu, Lu and Zhang, Chao},
   journal={Proc. ICLR},
   year={2026},
   address={Rio de Janeiro}