---
datasets:
- custom
language:
- en
license: apache-2.0
metrics:
- wer
- bleu
- AIR-Bench
pipeline_tag: audio-to-audio
tags:
- audio-text-to-audio-text
- speech-understanding
- audio
- chat
library_name: transformers
---
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
🐈⬛ Github | 
📃 Paper | 
🌐 Project Page | 
🚀 Space (8B) 
## Model Description
EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. This is the 3B version. By introducing **Echo Training**, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 6k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks.
### Key Features
- Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs
- Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)
- Trained on Only 6k Hours of Curated Data, Ensuring Efficiency
- Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks
- Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks
## Usage
The simplest code for EchoX inference is shown below. For more detailed instructions, including environment setup and model download, please refer to the [GitHub repository](https://github.com/FreedomIntelligence/EchoX).
### Simple Inference
```python
from echox.inference_solver import FlexARInferenceSolver
from echox.utils import load_audio
# ******************** Speech-to-Speech Generation ********************
inference_solver = FlexARInferenceSolver(
model_path="FreedomIntelligence/EchoX-8B", # or FreedomIntelligence/EchoX-3B
precision="bf16",
target_size=768,
)
# Load your audio file
audio_file = "path/to/your/audio.wav"
audio_tensor = load_audio(audio_file)
# Prepare prompt
q1 = f"Please read the audio you just received, then provide a detailed description and answer the question asked in the audio. <|audio|>"
# Perform inference
generated = inference_solver.generate(
audios=[audio_tensor],
qas=[[q1, None]],
max_gen_len=8192,
temperature=0.7,
# logits_processor=inference_solver.create_logits_processor(cfg=4.0, audio_top_k=2000), # optional
)
a1, new_audio = generated[0], generated[1][0]
print(f"Generated text: {a1}")
# Save the generated audio (if any)
if new_audio is not None:
# `new_audio` is a torch.Tensor, save it to a .wav file
# For example: torchaudio.save("output.wav", new_audio.cpu(), 16000)
pass
```
# 📖 Citation
```
@misc{zhang2025echoxmitigatingacousticsemanticgap,
title={EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs},
author={Yuhao Zhang and Yuhao Du and Zhanchen Dai and Xiangnan Ma and Kaiqi Kou and Benyou Wang and Haizhou Li},
year={2025},
eprint={2509.09174},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.09174},
}
```