EchoX-3B / README.md
nielsr's picture
nielsr HF Staff
Improve model card: Add pipeline tag, library, project page link, and sample usage
449e0da verified
|
raw
history blame
3.65 kB
metadata
datasets:
  - custom
language:
  - en
license: apache-2.0
metrics:
  - wer
  - bleu
  - AIR-Bench
pipeline_tag: audio-to-audio
tags:
  - audio-text-to-audio-text
  - speech-understanding
  - audio
  - chat
library_name: transformers

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

πŸˆβ€β¬› Github ο½œ  πŸ“ƒ Paper ο½œ  🌐 Project Page ο½œ  πŸš€ Space (8B) 

Model Description

EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. This is the 3B version. By introducing Echo Training, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 6k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks.

Key Features

  • Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs
  • Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)
  • Trained on Only 6k Hours of Curated Data, Ensuring Efficiency
  • Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks
  • Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks

Usage

The simplest code for EchoX inference is shown below. For more detailed instructions, including environment setup and model download, please refer to the GitHub repository.

Simple Inference

from echox.inference_solver import FlexARInferenceSolver
from echox.utils import load_audio

# ******************** Speech-to-Speech Generation ********************
inference_solver = FlexARInferenceSolver(
    model_path="FreedomIntelligence/EchoX-8B", # or FreedomIntelligence/EchoX-3B
    precision="bf16",
    target_size=768,
)

# Load your audio file
audio_file = "path/to/your/audio.wav"
audio_tensor = load_audio(audio_file)

# Prepare prompt
q1 = f"Please read the audio you just received, then provide a detailed description and answer the question asked in the audio. <|audio|>"

# Perform inference
generated = inference_solver.generate(
    audios=[audio_tensor],
    qas=[[q1, None]],
    max_gen_len=8192,
    temperature=0.7,
    # logits_processor=inference_solver.create_logits_processor(cfg=4.0, audio_top_k=2000), # optional
)

a1, new_audio = generated[0], generated[1][0]
print(f"Generated text: {a1}")
# Save the generated audio (if any)
if new_audio is not None:
    # `new_audio` is a torch.Tensor, save it to a .wav file
    # For example: torchaudio.save("output.wav", new_audio.cpu(), 16000)
    pass

πŸ“– Citation

@misc{zhang2025echoxmitigatingacousticsemanticgap,
      title={EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs}, 
      author={Yuhao Zhang and Yuhao Du and Zhanchen Dai and Xiangnan Ma and Kaiqi Kou and Benyou Wang and Haizhou Li},
      year={2025},
      eprint={2509.09174},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.09174}, 
}