EchoX-3B / README.md

nielsr HF Staff

Improve model card: Add pipeline tag, library, project page link, and sample usage

449e0da verified 5 months ago

preview code

raw

history blame

3.65 kB

metadata

datasets:
  - custom
language:
  - en
license: apache-2.0
metrics:
  - wer
  - bleu
  - AIR-Bench
pipeline_tag: audio-to-audio
tags:
  - audio-text-to-audio-text
  - speech-understanding
  - audio
  - chat
library_name: transformers

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

🐈‍⬛ Github ｜ 📃 Paper ｜ 🌐 Project Page ｜ 🚀 Space (8B)

Model Description

EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. This is the 3B version. By introducing Echo Training, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 6k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks.

Key Features

Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs

Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)

Trained on Only 6k Hours of Curated Data, Ensuring Efficiency

Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks

Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks

Usage

The simplest code for EchoX inference is shown below. For more detailed instructions, including environment setup and model download, please refer to the GitHub repository.

Simple Inference

from echox.inference_solver import FlexARInferenceSolver
from echox.utils import load_audio

# ******************** Speech-to-Speech Generation ********************
inference_solver = FlexARInferenceSolver(
    model_path="FreedomIntelligence/EchoX-8B", # or FreedomIntelligence/EchoX-3B
    precision="bf16",
    target_size=768,
)

# Load your audio file
audio_file = "path/to/your/audio.wav"
audio_tensor = load_audio(audio_file)

# Prepare prompt
q1 = f"Please read the audio you just received, then provide a detailed description and answer the question asked in the audio. <|audio|>"

# Perform inference
generated = inference_solver.generate(
    audios=[audio_tensor],
    qas=[[q1, None]],
    max_gen_len=8192,
    temperature=0.7,
    # logits_processor=inference_solver.create_logits_processor(cfg=4.0, audio_top_k=2000), # optional
)

a1, new_audio = generated[0], generated[1][0]
print(f"Generated text: {a1}")
# Save the generated audio (if any)
if new_audio is not None:
    # `new_audio` is a torch.Tensor, save it to a .wav file
    # For example: torchaudio.save("output.wav", new_audio.cpu(), 16000)
    pass

📖 Citation

@misc{zhang2025echoxmitigatingacousticsemanticgap,
      title={EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs}, 
      author={Yuhao Zhang and Yuhao Du and Zhanchen Dai and Xiangnan Ma and Kaiqi Kou and Benyou Wang and Haizhou Li},
      year={2025},
      eprint={2509.09174},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.09174}, 
}