metadata
datasets:
- custom
language:
- en
license: apache-2.0
metrics:
- wer
- bleu
- AIR-Bench
pipeline_tag: audio-to-audio
tags:
- audio-text-to-audio-text
- speech-understanding
- audio
- chat
library_name: transformers
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
πββ¬ Github ο½ π Paper ο½ π Project Page ο½ π Space (8B)
Model Description
EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. This is the 3B version. By introducing Echo Training, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 6k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks.
Key Features
- Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs
- Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)
- Trained on Only 6k Hours of Curated Data, Ensuring Efficiency
- Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks
- Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks
Usage
The simplest code for EchoX inference is shown below. For more detailed instructions, including environment setup and model download, please refer to the GitHub repository.
Simple Inference
from echox.inference_solver import FlexARInferenceSolver
from echox.utils import load_audio
# ******************** Speech-to-Speech Generation ********************
inference_solver = FlexARInferenceSolver(
model_path="FreedomIntelligence/EchoX-8B", # or FreedomIntelligence/EchoX-3B
precision="bf16",
target_size=768,
)
# Load your audio file
audio_file = "path/to/your/audio.wav"
audio_tensor = load_audio(audio_file)
# Prepare prompt
q1 = f"Please read the audio you just received, then provide a detailed description and answer the question asked in the audio. <|audio|>"
# Perform inference
generated = inference_solver.generate(
audios=[audio_tensor],
qas=[[q1, None]],
max_gen_len=8192,
temperature=0.7,
# logits_processor=inference_solver.create_logits_processor(cfg=4.0, audio_top_k=2000), # optional
)
a1, new_audio = generated[0], generated[1][0]
print(f"Generated text: {a1}")
# Save the generated audio (if any)
if new_audio is not None:
# `new_audio` is a torch.Tensor, save it to a .wav file
# For example: torchaudio.save("output.wav", new_audio.cpu(), 16000)
pass
π Citation
@misc{zhang2025echoxmitigatingacousticsemanticgap,
title={EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs},
author={Yuhao Zhang and Yuhao Du and Zhanchen Dai and Xiangnan Ma and Kaiqi Kou and Benyou Wang and Haizhou Li},
year={2025},
eprint={2509.09174},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.09174},
}