--- datasets: - custom language: - en license: apache-2.0 metrics: - wer - bleu - AIR-Bench pipeline_tag: audio-to-audio tags: - audio-text-to-audio-text - speech-understanding - audio - chat library_name: transformers ---

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

🐈‍⬛ Github ｜ 📃 Paper ｜ 🌐 Project Page ｜ 🚀 Space (8B)

## Model Description EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. This is the 3B version. By introducing **Echo Training**, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 6k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks. ### Key Features

Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs

Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)

Trained on Only 6k Hours of Curated Data, Ensuring Efficiency

Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks

Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks

## Usage The simplest code for EchoX inference is shown below. For more detailed instructions, including environment setup and model download, please refer to the [GitHub repository](https://github.com/FreedomIntelligence/EchoX). ### Simple Inference ```python from echox.inference_solver import FlexARInferenceSolver from echox.utils import load_audio # ******************** Speech-to-Speech Generation ******************** inference_solver = FlexARInferenceSolver( model_path="FreedomIntelligence/EchoX-8B", # or FreedomIntelligence/EchoX-3B precision="bf16", target_size=768, ) # Load your audio file audio_file = "path/to/your/audio.wav" audio_tensor = load_audio(audio_file) # Prepare prompt q1 = f"Please read the audio you just received, then provide a detailed description and answer the question asked in the audio. <|audio|>" # Perform inference generated = inference_solver.generate( audios=[audio_tensor], qas=[[q1, None]], max_gen_len=8192, temperature=0.7, # logits_processor=inference_solver.create_logits_processor(cfg=4.0, audio_top_k=2000), # optional ) a1, new_audio = generated[0], generated[1][0] print(f"Generated text: {a1}") # Save the generated audio (if any) if new_audio is not None: # `new_audio` is a torch.Tensor, save it to a .wav file # For example: torchaudio.save("output.wav", new_audio.cpu(), 16000) pass ``` # 📖 Citation ``` @misc{zhang2025echoxmitigatingacousticsemanticgap, title={EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs}, author={Yuhao Zhang and Yuhao Du and Zhanchen Dai and Xiangnan Ma and Kaiqi Kou and Benyou Wang and Haizhou Li}, year={2025}, eprint={2509.09174}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.09174}, } ```