EchoX-3B / README.md

Improve model card: Add pipeline tag, library, project page link, and sample usage

449e0da verified 5 months ago

3.65 kB

	---
	datasets:
	- custom
	language:
	- en
	license: apache-2.0
	metrics:
	- wer
	- bleu
	- AIR-Bench
	pipeline_tag: audio-to-audio
	tags:
	- audio-text-to-audio-text
	- speech-understanding
	- audio
	- chat
	library_name: transformers
	---

	<div align="center">
	<h1>
	EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
	</h1>
	</div>

	<p align="center">
	<font size="3">
	<a href="https://github.com/FreedomIntelligence/EchoX">🐈‍⬛ Github</a>&nbsp｜&nbsp
	<a href="https://arxiv.org/abs/2509.09174">📃 Paper</a>&nbsp｜&nbsp
	<a href="https://freedomintelligence.github.io/EchoX/">🌐 Project Page</a>&nbsp｜&nbsp
	<a href="https://huggingface.co/spaces/FreedomIntelligence/EchoX">🚀 Space (8B)</a>&nbsp
	</font>
	</p>

	## Model Description
	EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. This is the 3B version. By introducing Echo Training, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 6k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks.

	### Key Features
	<div>
	<ul>
	<font size="3"><li>Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs</li></font>
	<font size="3"><li>Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)</li></font>
	<font size="3"><li>Trained on Only 6k Hours of Curated Data, Ensuring Efficiency</li></font>
	<font size="3"><li>Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks</li></font>
	<font size="3"><li>Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks</li></font>
	</ul>
	</div>

	## Usage

	The simplest code for EchoX inference is shown below. For more detailed instructions, including environment setup and model download, please refer to the [GitHub repository](https://github.com/FreedomIntelligence/EchoX).

	### Simple Inference
	```python
	from echox.inference_solver import FlexARInferenceSolver
	from echox.utils import load_audio

	# ****************** Speech-to-Speech Generation ******************
	inference_solver = FlexARInferenceSolver(
	model_path="FreedomIntelligence/EchoX-8B", # or FreedomIntelligence/EchoX-3B
	precision="bf16",
	target_size=768,
	)

	# Load your audio file
	audio_file = "path/to/your/audio.wav"
	audio_tensor = load_audio(audio_file)

	# Prepare prompt
	q1 = f"Please read the audio you just received, then provide a detailed description and answer the question asked in the audio. <\|audio\|>"

	# Perform inference
	generated = inference_solver.generate(
	audios=[audio_tensor],
	qas=[[q1, None]],
	max_gen_len=8192,
	temperature=0.7,
	# logits_processor=inference_solver.create_logits_processor(cfg=4.0, audio_top_k=2000), # optional
	)

	a1, new_audio = generated[0], generated[1][0]
	print(f"Generated text: {a1}")
	# Save the generated audio (if any)
	if new_audio is not None:
	# `new_audio` is a torch.Tensor, save it to a .wav file
	# For example: torchaudio.save("output.wav", new_audio.cpu(), 16000)
	pass
	```

	# <span>📖 Citation</span>
	```
	@misc{zhang2025echoxmitigatingacousticsemanticgap,
	title={EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs},
	author={Yuhao Zhang and Yuhao Du and Zhanchen Dai and Xiangnan Ma and Kaiqi Kou and Benyou Wang and Haizhou Li},
	year={2025},
	eprint={2509.09174},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2509.09174},
	}
	```