FreedomIntelligence
/

EchoX-3B

@@ -1,19 +1,22 @@
 ---
 language:
 - en
-tags:
-- audio-text-to-audio-text
-- speech-understanding
-- audio
-- chat
 license: apache-2.0
-datasets:
-- custom
 metrics:
 - wer
 - bleu
 - AIR-Bench
 ---
 <div align="center">
 <h1>
   EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
@@ -21,25 +24,68 @@ metrics:
 </div>
 <p align="center">
-  <font size="3"><a href="https://github.com/FreedomIntelligence/EchoX">🐈‍⬛ Github</a>&nbsp｜&nbsp<a href="https://arxiv.org/abs/2509.09174">📃 Paper</a>&nbsp｜&nbsp<a href="https://huggingface.co/spaces/FreedomIntelligence/EchoX">🚀 Space (8B)</a>&nbsp</font>
 </p>
 ## Model Description
-EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. This is the 3B version. By introducing **Echo Training**, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 10k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks.
 ### Key Features
 <div>
   <ul>
     <font size="3"><li>Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs</li></font>
     <font size="3"><li>Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)</li></font>
-    <font size="3"><li>Trained on Only 10k Hours of Curated Data, Ensuring Efficiency</li></font>
     <font size="3"><li>Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks</li></font>
     <font size="3"><li>Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks</li></font>
   </ul>
 </div>
 ## Usage
-Load the EchoX model and run inference with your audio files as shown in the <a href="https://github.com/FreedomIntelligence/EchoX">GitHub repository</a>.
 # <span>📖 Citation</span>
 ```

 ---
+datasets:
+- custom
 language:
 - en
 license: apache-2.0
 metrics:
 - wer
 - bleu
 - AIR-Bench
+pipeline_tag: audio-to-audio
+tags:
+- audio-text-to-audio-text
+- speech-understanding
+- audio
+- chat
+library_name: transformers
 ---
 <div align="center">
 <h1>
   EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
 </div>
 <p align="center">
+  <font size="3">
+    <a href="https://github.com/FreedomIntelligence/EchoX">🐈‍⬛ Github</a>&nbsp｜&nbsp
+    <a href="https://arxiv.org/abs/2509.09174">📃 Paper</a>&nbsp｜&nbsp
+    <a href="https://freedomintelligence.github.io/EchoX/">🌐 Project Page</a>&nbsp｜&nbsp
+    <a href="https://huggingface.co/spaces/FreedomIntelligence/EchoX">🚀 Space (8B)</a>&nbsp
+  </font>
 </p>
 ## Model Description
+EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. This is the 3B version. By introducing **Echo Training**, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 6k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks.
 ### Key Features
 <div>
   <ul>
     <font size="3"><li>Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs</li></font>
     <font size="3"><li>Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)</li></font>
+    <font size="3"><li>Trained on Only 6k Hours of Curated Data, Ensuring Efficiency</li></font>
     <font size="3"><li>Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks</li></font>
     <font size="3"><li>Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks</li></font>
   </ul>
 </div>
 ## Usage
+The simplest code for EchoX inference is shown below. For more detailed instructions, including environment setup and model download, please refer to the [GitHub repository](https://github.com/FreedomIntelligence/EchoX).
+### Simple Inference
+```python
+from echox.inference_solver import FlexARInferenceSolver
+from echox.utils import load_audio
+# ******************** Speech-to-Speech Generation ********************
+inference_solver = FlexARInferenceSolver(
+    model_path="FreedomIntelligence/EchoX-8B", # or FreedomIntelligence/EchoX-3B
+    precision="bf16",
+    target_size=768,
+)
+# Load your audio file
+audio_file = "path/to/your/audio.wav"
+audio_tensor = load_audio(audio_file)
+# Prepare prompt
+q1 = f"Please read the audio you just received, then provide a detailed description and answer the question asked in the audio. <|audio|>"
+# Perform inference
+generated = inference_solver.generate(
+    audios=[audio_tensor],
+    qas=[[q1, None]],
+    max_gen_len=8192,
+    temperature=0.7,
+    # logits_processor=inference_solver.create_logits_processor(cfg=4.0, audio_top_k=2000), # optional
+)
+a1, new_audio = generated[0], generated[1][0]
+print(f"Generated text: {a1}")
+# Save the generated audio (if any)
+if new_audio is not None:
+    # `new_audio` is a torch.Tensor, save it to a .wav file
+    # For example: torchaudio.save("output.wav", new_audio.cpu(), 16000)
+    pass
+```
 # <span>📖 Citation</span>
 ```