Improve model card: Add pipeline tag, library, project page link, and sample usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +57 -11
README.md CHANGED
@@ -1,19 +1,22 @@
1
  ---
 
 
2
  language:
3
  - en
4
- tags:
5
- - audio-text-to-audio-text
6
- - speech-understanding
7
- - audio
8
- - chat
9
  license: apache-2.0
10
- datasets:
11
- - custom
12
  metrics:
13
  - wer
14
  - bleu
15
  - AIR-Bench
 
 
 
 
 
 
 
16
  ---
 
17
  <div align="center">
18
  <h1>
19
  EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
@@ -21,25 +24,68 @@ metrics:
21
  </div>
22
 
23
  <p align="center">
24
- <font size="3"><a href="https://github.com/FreedomIntelligence/EchoX">🐈‍⬛ Github</a>&nbsp|&nbsp<a href="https://arxiv.org/abs/2509.09174">📃 Paper</a>&nbsp|&nbsp<a href="https://huggingface.co/spaces/FreedomIntelligence/EchoX">🚀 Space (8B)</a>&nbsp</font>
 
 
 
 
 
25
  </p>
26
 
27
  ## Model Description
28
- EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. This is the 3B version. By introducing **Echo Training**, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 10k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks.
29
 
30
  ### Key Features
31
  <div>
32
  <ul>
33
  <font size="3"><li>Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs</li></font>
34
  <font size="3"><li>Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)</li></font>
35
- <font size="3"><li>Trained on Only 10k Hours of Curated Data, Ensuring Efficiency</li></font>
36
  <font size="3"><li>Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks</li></font>
37
  <font size="3"><li>Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks</li></font>
38
  </ul>
39
  </div>
40
 
41
  ## Usage
42
- Load the EchoX model and run inference with your audio files as shown in the <a href="https://github.com/FreedomIntelligence/EchoX">GitHub repository</a>.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  # <span>📖 Citation</span>
45
  ```
 
1
  ---
2
+ datasets:
3
+ - custom
4
  language:
5
  - en
 
 
 
 
 
6
  license: apache-2.0
 
 
7
  metrics:
8
  - wer
9
  - bleu
10
  - AIR-Bench
11
+ pipeline_tag: audio-to-audio
12
+ tags:
13
+ - audio-text-to-audio-text
14
+ - speech-understanding
15
+ - audio
16
+ - chat
17
+ library_name: transformers
18
  ---
19
+
20
  <div align="center">
21
  <h1>
22
  EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
 
24
  </div>
25
 
26
  <p align="center">
27
+ <font size="3">
28
+ <a href="https://github.com/FreedomIntelligence/EchoX">🐈‍⬛ Github</a>&nbsp|&nbsp
29
+ <a href="https://arxiv.org/abs/2509.09174">📃 Paper</a>&nbsp|&nbsp
30
+ <a href="https://freedomintelligence.github.io/EchoX/">🌐 Project Page</a>&nbsp|&nbsp
31
+ <a href="https://huggingface.co/spaces/FreedomIntelligence/EchoX">🚀 Space (8B)</a>&nbsp
32
+ </font>
33
  </p>
34
 
35
  ## Model Description
36
+ EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. This is the 3B version. By introducing **Echo Training**, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 6k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks.
37
 
38
  ### Key Features
39
  <div>
40
  <ul>
41
  <font size="3"><li>Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs</li></font>
42
  <font size="3"><li>Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)</li></font>
43
+ <font size="3"><li>Trained on Only 6k Hours of Curated Data, Ensuring Efficiency</li></font>
44
  <font size="3"><li>Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks</li></font>
45
  <font size="3"><li>Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks</li></font>
46
  </ul>
47
  </div>
48
 
49
  ## Usage
50
+
51
+ The simplest code for EchoX inference is shown below. For more detailed instructions, including environment setup and model download, please refer to the [GitHub repository](https://github.com/FreedomIntelligence/EchoX).
52
+
53
+ ### Simple Inference
54
+ ```python
55
+ from echox.inference_solver import FlexARInferenceSolver
56
+ from echox.utils import load_audio
57
+
58
+ # ******************** Speech-to-Speech Generation ********************
59
+ inference_solver = FlexARInferenceSolver(
60
+ model_path="FreedomIntelligence/EchoX-8B", # or FreedomIntelligence/EchoX-3B
61
+ precision="bf16",
62
+ target_size=768,
63
+ )
64
+
65
+ # Load your audio file
66
+ audio_file = "path/to/your/audio.wav"
67
+ audio_tensor = load_audio(audio_file)
68
+
69
+ # Prepare prompt
70
+ q1 = f"Please read the audio you just received, then provide a detailed description and answer the question asked in the audio. <|audio|>"
71
+
72
+ # Perform inference
73
+ generated = inference_solver.generate(
74
+ audios=[audio_tensor],
75
+ qas=[[q1, None]],
76
+ max_gen_len=8192,
77
+ temperature=0.7,
78
+ # logits_processor=inference_solver.create_logits_processor(cfg=4.0, audio_top_k=2000), # optional
79
+ )
80
+
81
+ a1, new_audio = generated[0], generated[1][0]
82
+ print(f"Generated text: {a1}")
83
+ # Save the generated audio (if any)
84
+ if new_audio is not None:
85
+ # `new_audio` is a torch.Tensor, save it to a .wav file
86
+ # For example: torchaudio.save("output.wav", new_audio.cpu(), 16000)
87
+ pass
88
+ ```
89
 
90
  # <span>📖 Citation</span>
91
  ```