Improve model card: Add pipeline tag, library, project page link, and sample usage
Browse filesThis PR enhances the model card by:
- Adding `pipeline_tag: audio-to-audio` to ensure discoverability for models processing audio input to audio output. You can find models using this pipeline tag at https://huggingface.co/models?pipeline_tag=audio-to-audio.
- Specifying `library_name: transformers`, which indicates compatibility with the Hugging Face Transformers library, enabling the automated display of a "how to use" widget. The `config.json` file confirms `transformers_version: 4.48.1` and `architectures: ["ACLlamaForCausalLM"]`.
- Including a direct link to the project page: https://freedomintelligence.github.io/EchoX/.
- Adding a concrete Python sample usage code snippet from the official GitHub repository to demonstrate how to run inference.
- Correcting the training data duration in the "Model Description" from 10k hours to 6k hours for consistency with other sections and the paper's details.
Please review and merge if these improvements align with the repository's goals.
|
@@ -1,19 +1,22 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
-
tags:
|
| 5 |
-
- audio-text-to-audio-text
|
| 6 |
-
- speech-understanding
|
| 7 |
-
- audio
|
| 8 |
-
- chat
|
| 9 |
license: apache-2.0
|
| 10 |
-
datasets:
|
| 11 |
-
- custom
|
| 12 |
metrics:
|
| 13 |
- wer
|
| 14 |
- bleu
|
| 15 |
- AIR-Bench
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
---
|
|
|
|
| 17 |
<div align="center">
|
| 18 |
<h1>
|
| 19 |
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
|
|
@@ -21,25 +24,68 @@ metrics:
|
|
| 21 |
</div>
|
| 22 |
|
| 23 |
<p align="center">
|
| 24 |
-
<font size="3"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
</p>
|
| 26 |
|
| 27 |
## Model Description
|
| 28 |
-
EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. This is the 3B version. By introducing **Echo Training**, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only
|
| 29 |
|
| 30 |
### Key Features
|
| 31 |
<div>
|
| 32 |
<ul>
|
| 33 |
<font size="3"><li>Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs</li></font>
|
| 34 |
<font size="3"><li>Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)</li></font>
|
| 35 |
-
<font size="3"><li>Trained on Only
|
| 36 |
<font size="3"><li>Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks</li></font>
|
| 37 |
<font size="3"><li>Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks</li></font>
|
| 38 |
</ul>
|
| 39 |
</div>
|
| 40 |
|
| 41 |
## Usage
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
# <span>📖 Citation</span>
|
| 45 |
```
|
|
|
|
| 1 |
---
|
| 2 |
+
datasets:
|
| 3 |
+
- custom
|
| 4 |
language:
|
| 5 |
- en
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
license: apache-2.0
|
|
|
|
|
|
|
| 7 |
metrics:
|
| 8 |
- wer
|
| 9 |
- bleu
|
| 10 |
- AIR-Bench
|
| 11 |
+
pipeline_tag: audio-to-audio
|
| 12 |
+
tags:
|
| 13 |
+
- audio-text-to-audio-text
|
| 14 |
+
- speech-understanding
|
| 15 |
+
- audio
|
| 16 |
+
- chat
|
| 17 |
+
library_name: transformers
|
| 18 |
---
|
| 19 |
+
|
| 20 |
<div align="center">
|
| 21 |
<h1>
|
| 22 |
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
|
|
|
|
| 24 |
</div>
|
| 25 |
|
| 26 |
<p align="center">
|
| 27 |
+
<font size="3">
|
| 28 |
+
<a href="https://github.com/FreedomIntelligence/EchoX">🐈⬛ Github</a> | 
|
| 29 |
+
<a href="https://arxiv.org/abs/2509.09174">📃 Paper</a> | 
|
| 30 |
+
<a href="https://freedomintelligence.github.io/EchoX/">🌐 Project Page</a> | 
|
| 31 |
+
<a href="https://huggingface.co/spaces/FreedomIntelligence/EchoX">🚀 Space (8B)</a> 
|
| 32 |
+
</font>
|
| 33 |
</p>
|
| 34 |
|
| 35 |
## Model Description
|
| 36 |
+
EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. This is the 3B version. By introducing **Echo Training**, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 6k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks.
|
| 37 |
|
| 38 |
### Key Features
|
| 39 |
<div>
|
| 40 |
<ul>
|
| 41 |
<font size="3"><li>Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs</li></font>
|
| 42 |
<font size="3"><li>Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)</li></font>
|
| 43 |
+
<font size="3"><li>Trained on Only 6k Hours of Curated Data, Ensuring Efficiency</li></font>
|
| 44 |
<font size="3"><li>Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks</li></font>
|
| 45 |
<font size="3"><li>Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks</li></font>
|
| 46 |
</ul>
|
| 47 |
</div>
|
| 48 |
|
| 49 |
## Usage
|
| 50 |
+
|
| 51 |
+
The simplest code for EchoX inference is shown below. For more detailed instructions, including environment setup and model download, please refer to the [GitHub repository](https://github.com/FreedomIntelligence/EchoX).
|
| 52 |
+
|
| 53 |
+
### Simple Inference
|
| 54 |
+
```python
|
| 55 |
+
from echox.inference_solver import FlexARInferenceSolver
|
| 56 |
+
from echox.utils import load_audio
|
| 57 |
+
|
| 58 |
+
# ******************** Speech-to-Speech Generation ********************
|
| 59 |
+
inference_solver = FlexARInferenceSolver(
|
| 60 |
+
model_path="FreedomIntelligence/EchoX-8B", # or FreedomIntelligence/EchoX-3B
|
| 61 |
+
precision="bf16",
|
| 62 |
+
target_size=768,
|
| 63 |
+
)
|
| 64 |
+
|
| 65 |
+
# Load your audio file
|
| 66 |
+
audio_file = "path/to/your/audio.wav"
|
| 67 |
+
audio_tensor = load_audio(audio_file)
|
| 68 |
+
|
| 69 |
+
# Prepare prompt
|
| 70 |
+
q1 = f"Please read the audio you just received, then provide a detailed description and answer the question asked in the audio. <|audio|>"
|
| 71 |
+
|
| 72 |
+
# Perform inference
|
| 73 |
+
generated = inference_solver.generate(
|
| 74 |
+
audios=[audio_tensor],
|
| 75 |
+
qas=[[q1, None]],
|
| 76 |
+
max_gen_len=8192,
|
| 77 |
+
temperature=0.7,
|
| 78 |
+
# logits_processor=inference_solver.create_logits_processor(cfg=4.0, audio_top_k=2000), # optional
|
| 79 |
+
)
|
| 80 |
+
|
| 81 |
+
a1, new_audio = generated[0], generated[1][0]
|
| 82 |
+
print(f"Generated text: {a1}")
|
| 83 |
+
# Save the generated audio (if any)
|
| 84 |
+
if new_audio is not None:
|
| 85 |
+
# `new_audio` is a torch.Tensor, save it to a .wav file
|
| 86 |
+
# For example: torchaudio.save("output.wav", new_audio.cpu(), 16000)
|
| 87 |
+
pass
|
| 88 |
+
```
|
| 89 |
|
| 90 |
# <span>📖 Citation</span>
|
| 91 |
```
|