|
|
--- |
|
|
datasets: |
|
|
- custom |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
metrics: |
|
|
- wer |
|
|
- bleu |
|
|
- AIR-Bench |
|
|
pipeline_tag: audio-to-audio |
|
|
tags: |
|
|
- audio-text-to-audio-text |
|
|
- speech-understanding |
|
|
- audio |
|
|
- chat |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
<h1> |
|
|
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs |
|
|
</h1> |
|
|
</div> |
|
|
|
|
|
<p align="center"> |
|
|
<font size="3"> |
|
|
<a href="https://github.com/FreedomIntelligence/EchoX">🐈⬛ Github</a> |  |
|
|
<a href="https://arxiv.org/abs/2509.09174">📃 Paper</a> |  |
|
|
<a href="https://freedomintelligence.github.io/EchoX/">🌐 Project Page</a> |  |
|
|
<a href="https://huggingface.co/spaces/FreedomIntelligence/EchoX">🚀 Space (8B)</a>  |
|
|
</font> |
|
|
</p> |
|
|
|
|
|
## Model Description |
|
|
EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. This is the 3B version. By introducing **Echo Training**, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 6k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks. |
|
|
|
|
|
### Key Features |
|
|
<div> |
|
|
<ul> |
|
|
<font size="3"><li>Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs</li></font> |
|
|
<font size="3"><li>Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)</li></font> |
|
|
<font size="3"><li>Trained on Only 6k Hours of Curated Data, Ensuring Efficiency</li></font> |
|
|
<font size="3"><li>Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks</li></font> |
|
|
<font size="3"><li>Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks</li></font> |
|
|
</ul> |
|
|
</div> |
|
|
|
|
|
## Usage |
|
|
|
|
|
The simplest code for EchoX inference is shown below. For more detailed instructions, including environment setup and model download, please refer to the [GitHub repository](https://github.com/FreedomIntelligence/EchoX). |
|
|
|
|
|
### Simple Inference |
|
|
```python |
|
|
from echox.inference_solver import FlexARInferenceSolver |
|
|
from echox.utils import load_audio |
|
|
|
|
|
# ******************** Speech-to-Speech Generation ******************** |
|
|
inference_solver = FlexARInferenceSolver( |
|
|
model_path="FreedomIntelligence/EchoX-8B", # or FreedomIntelligence/EchoX-3B |
|
|
precision="bf16", |
|
|
target_size=768, |
|
|
) |
|
|
|
|
|
# Load your audio file |
|
|
audio_file = "path/to/your/audio.wav" |
|
|
audio_tensor = load_audio(audio_file) |
|
|
|
|
|
# Prepare prompt |
|
|
q1 = f"Please read the audio you just received, then provide a detailed description and answer the question asked in the audio. <|audio|>" |
|
|
|
|
|
# Perform inference |
|
|
generated = inference_solver.generate( |
|
|
audios=[audio_tensor], |
|
|
qas=[[q1, None]], |
|
|
max_gen_len=8192, |
|
|
temperature=0.7, |
|
|
# logits_processor=inference_solver.create_logits_processor(cfg=4.0, audio_top_k=2000), # optional |
|
|
) |
|
|
|
|
|
a1, new_audio = generated[0], generated[1][0] |
|
|
print(f"Generated text: {a1}") |
|
|
# Save the generated audio (if any) |
|
|
if new_audio is not None: |
|
|
# `new_audio` is a torch.Tensor, save it to a .wav file |
|
|
# For example: torchaudio.save("output.wav", new_audio.cpu(), 16000) |
|
|
pass |
|
|
``` |
|
|
|
|
|
# <span>📖 Citation</span> |
|
|
``` |
|
|
@misc{zhang2025echoxmitigatingacousticsemanticgap, |
|
|
title={EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs}, |
|
|
author={Yuhao Zhang and Yuhao Du and Zhanchen Dai and Xiangnan Ma and Kaiqi Kou and Benyou Wang and Haizhou Li}, |
|
|
year={2025}, |
|
|
eprint={2509.09174}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2509.09174}, |
|
|
} |
|
|
``` |