Improve model card: Add pipeline tag, library name, project page link, and sample usage

This PR improves the model card by:

- Adding the `pipeline_tag: audio-to-audio` to the metadata, enabling easier discovery on the Hugging Face Hub at https://huggingface.co/models?pipeline_tag=audio-to-audio.
- Including `library_name: transformers` in the metadata, as evidence from `config.json` and the model's architecture suggests compatibility with the Hugging Face `transformers` library, which will enable the automated "how to use" widget.
- Adding an explicit link to the project page (`https://freedomintelligence.github.io/EchoX/`) in the top section of the model card.
- Replacing the generic "Usage" section with a comprehensive "Sample Usage" section, including environment setup, model download, and inference commands directly from the official GitHub repository's README, allowing users to quickly get started.
- Correcting the training data size in the "Model Description" from 10k hours to 6k hours, aligning with the paper's abstract and the GitHub README.

These changes provide more comprehensive information and improve user experience on the Hugging Face Hub.

Files changed (1) hide show

README.md +40 -12

README.md CHANGED Viewed

@@ -1,19 +1,22 @@
 ---
 language:
 - en
-tags:
-- audio-text-to-audio-text
-- speech-understanding
-- audio
-- chat
 license: apache-2.0
-datasets:
-- custom
 metrics:
 - wer
 - bleu
 - AIR-Bench
 ---
 <div align="center">
 <h1>
   EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
@@ -21,25 +24,50 @@ metrics:
 </div>
 <p align="center">
-  <font size="3"><a href="https://github.com/FreedomIntelligence/EchoX">🐈‍⬛ Github</a>&nbsp｜&nbsp<a href="https://arxiv.org/abs/2509.09174">📃 Paper</a>&nbsp｜&nbsp<a href="https://huggingface.co/spaces/FreedomIntelligence/EchoX">🚀 Space</a>&nbsp</font>
 </p>
 ## Model Description
-EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. By introducing **Echo Training**, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 10k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks.
 ### Key Features
 <div>
   <ul>
     <font size="3"><li>Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs</li></font>
     <font size="3"><li>Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)</li></font>
-    <font size="3"><li>Trained on Only 10k Hours of Curated Data, Ensuring Efficiency</li></font>
     <font size="3"><li>Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks</li></font>
     <font size="3"><li>Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks</li></font>
   </ul>
 </div>
-## Usage
-Load the EchoX model and run inference with your audio files as shown in the <a href="https://github.com/FreedomIntelligence/EchoX">GitHub repository</a>.
 # <span>📖 Citation</span>
 ```

 ---
+datasets:
+- custom
 language:
 - en
 license: apache-2.0
 metrics:
 - wer
 - bleu
 - AIR-Bench
+pipeline_tag: audio-to-audio
+tags:
+- audio-text-to-audio-text
+- speech-understanding
+- audio
+- chat
+library_name: transformers
 ---
 <div align="center">
 <h1>
   EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
 </div>
 <p align="center">
+  <font size="3"><a href="https://github.com/FreedomIntelligence/EchoX">🐈‍⬛ Github</a>&nbsp｜&nbsp<a href="https://arxiv.org/abs/2509.09174">📃 Paper</a>&nbsp｜&nbsp<a href="https://freedomintelligence.github.io/EchoX">🌐 Project Page</a>&nbsp｜&nbsp<a href="https://huggingface.co/spaces/FreedomIntelligence/EchoX">🚀 Space</a>&nbsp</font>
 </p>
 ## Model Description
+EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. By introducing **Echo Training**, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 6k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks.
 ### Key Features
 <div>
   <ul>
     <font size="3"><li>Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs</li></font>
     <font size="3"><li>Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)</li></font>
+    <font size="3"><li>Trained on Only 6k Hours of Curated Data, Ensuring Efficiency</li></font>
     <font size="3"><li>Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks</li></font>
     <font size="3"><li>Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks</li></font>
   </ul>
 </div>
+## Sample Usage
+To set up your environment and run inference, follow these steps from the [GitHub repository](https://github.com/FreedomIntelligence/EchoX):
+First, clone the repository, set up the environment, and install dependencies:
+```bash
+git clone https://github.com/FreedomIntelligence/EchoX.git
+cd EchoX
+conda create -n echox python=3.10 pip=24.0
+conda activate echox
+pip install -r requirements.txt
+```
+Next, download the models:
+```bash
+pip install -U huggingface_hub
+hf download --resume-download FreedomIntelligence/EchoX-8B --local-dir EchoX-8B
+hf download --resume-download openai/whisper-large-v3 --local-dir whisper-large-v3
+```
+Finally, run inference on a test case, or start the Gradio web interface:
+```bash
+python demo.py
+# Alternatively, start the Gradio web interface:
+# python app.py
+# To use a specific GPU:
+# CUDA_VISIBLE_DEVICES=1 python app.py
+```
 # <span>📖 Citation</span>
 ```