EchoX-8B / README.md
nielsr's picture
nielsr HF Staff
Improve model card: Add pipeline tag, library name, project page link, and sample usage
5de3136 verified
|
raw
history blame
3.08 kB
metadata
datasets:
  - custom
language:
  - en
license: apache-2.0
metrics:
  - wer
  - bleu
  - AIR-Bench
pipeline_tag: audio-to-audio
tags:
  - audio-text-to-audio-text
  - speech-understanding
  - audio
  - chat
library_name: transformers

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

πŸˆβ€β¬› Github ο½œ πŸ“ƒ Paper ο½œ πŸŒ Project Page ο½œ πŸš€ Space 

Model Description

EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. By introducing Echo Training, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 6k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks.

Key Features

  • Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs
  • Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)
  • Trained on Only 6k Hours of Curated Data, Ensuring Efficiency
  • Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks
  • Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks

Sample Usage

To set up your environment and run inference, follow these steps from the GitHub repository:

First, clone the repository, set up the environment, and install dependencies:

git clone https://github.com/FreedomIntelligence/EchoX.git
cd EchoX
conda create -n echox python=3.10 pip=24.0
conda activate echox
pip install -r requirements.txt

Next, download the models:

pip install -U huggingface_hub
hf download --resume-download FreedomIntelligence/EchoX-8B --local-dir EchoX-8B
hf download --resume-download openai/whisper-large-v3 --local-dir whisper-large-v3

Finally, run inference on a test case, or start the Gradio web interface:

python demo.py
# Alternatively, start the Gradio web interface:
# python app.py
# To use a specific GPU:
# CUDA_VISIBLE_DEVICES=1 python app.py

πŸ“– Citation

@misc{zhang2025echoxmitigatingacousticsemanticgap,
      title={EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs}, 
      author={Yuhao Zhang and Yuhao Du and Zhanchen Dai and Xiangnan Ma and Kaiqi Kou and Benyou Wang and Haizhou Li},
      year={2025},
      eprint={2509.09174},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.09174}, 
}