nielsr HF Staff commited on
Commit
449e0da
·
verified ·
1 Parent(s): 703f3e2

Improve model card: Add pipeline tag, library, project page link, and sample usage

Browse files

This PR enhances the model card by:

- Adding `pipeline_tag: audio-to-audio` to ensure discoverability for models processing audio input to audio output. You can find models using this pipeline tag at https://huggingface.co/models?pipeline_tag=audio-to-audio.
- Specifying `library_name: transformers`, which indicates compatibility with the Hugging Face Transformers library, enabling the automated display of a "how to use" widget. The `config.json` file confirms `transformers_version: 4.48.1` and `architectures: ["ACLlamaForCausalLM"]`.
- Including a direct link to the project page: https://freedomintelligence.github.io/EchoX/.
- Adding a concrete Python sample usage code snippet from the official GitHub repository to demonstrate how to run inference.
- Correcting the training data duration in the "Model Description" from 10k hours to 6k hours for consistency with other sections and the paper's details.

Please review and merge if these improvements align with the repository's goals.

Files changed (1) hide show
  1. README.md +57 -11
README.md CHANGED
@@ -1,19 +1,22 @@
1
  ---
 
 
2
  language:
3
  - en
4
- tags:
5
- - audio-text-to-audio-text
6
- - speech-understanding
7
- - audio
8
- - chat
9
  license: apache-2.0
10
- datasets:
11
- - custom
12
  metrics:
13
  - wer
14
  - bleu
15
  - AIR-Bench
 
 
 
 
 
 
 
16
  ---
 
17
  <div align="center">
18
  <h1>
19
  EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
@@ -21,25 +24,68 @@ metrics:
21
  </div>
22
 
23
  <p align="center">
24
- <font size="3"><a href="https://github.com/FreedomIntelligence/EchoX">🐈‍⬛ Github</a>&nbsp|&nbsp<a href="https://arxiv.org/abs/2509.09174">📃 Paper</a>&nbsp|&nbsp<a href="https://huggingface.co/spaces/FreedomIntelligence/EchoX">🚀 Space (8B)</a>&nbsp</font>
 
 
 
 
 
25
  </p>
26
 
27
  ## Model Description
28
- EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. This is the 3B version. By introducing **Echo Training**, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 10k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks.
29
 
30
  ### Key Features
31
  <div>
32
  <ul>
33
  <font size="3"><li>Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs</li></font>
34
  <font size="3"><li>Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)</li></font>
35
- <font size="3"><li>Trained on Only 10k Hours of Curated Data, Ensuring Efficiency</li></font>
36
  <font size="3"><li>Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks</li></font>
37
  <font size="3"><li>Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks</li></font>
38
  </ul>
39
  </div>
40
 
41
  ## Usage
42
- Load the EchoX model and run inference with your audio files as shown in the <a href="https://github.com/FreedomIntelligence/EchoX">GitHub repository</a>.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  # <span>📖 Citation</span>
45
  ```
 
1
  ---
2
+ datasets:
3
+ - custom
4
  language:
5
  - en
 
 
 
 
 
6
  license: apache-2.0
 
 
7
  metrics:
8
  - wer
9
  - bleu
10
  - AIR-Bench
11
+ pipeline_tag: audio-to-audio
12
+ tags:
13
+ - audio-text-to-audio-text
14
+ - speech-understanding
15
+ - audio
16
+ - chat
17
+ library_name: transformers
18
  ---
19
+
20
  <div align="center">
21
  <h1>
22
  EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
 
24
  </div>
25
 
26
  <p align="center">
27
+ <font size="3">
28
+ <a href="https://github.com/FreedomIntelligence/EchoX">🐈‍⬛ Github</a>&nbsp|&nbsp
29
+ <a href="https://arxiv.org/abs/2509.09174">📃 Paper</a>&nbsp|&nbsp
30
+ <a href="https://freedomintelligence.github.io/EchoX/">🌐 Project Page</a>&nbsp|&nbsp
31
+ <a href="https://huggingface.co/spaces/FreedomIntelligence/EchoX">🚀 Space (8B)</a>&nbsp
32
+ </font>
33
  </p>
34
 
35
  ## Model Description
36
+ EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. This is the 3B version. By introducing **Echo Training**, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 6k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks.
37
 
38
  ### Key Features
39
  <div>
40
  <ul>
41
  <font size="3"><li>Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs</li></font>
42
  <font size="3"><li>Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)</li></font>
43
+ <font size="3"><li>Trained on Only 6k Hours of Curated Data, Ensuring Efficiency</li></font>
44
  <font size="3"><li>Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks</li></font>
45
  <font size="3"><li>Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks</li></font>
46
  </ul>
47
  </div>
48
 
49
  ## Usage
50
+
51
+ The simplest code for EchoX inference is shown below. For more detailed instructions, including environment setup and model download, please refer to the [GitHub repository](https://github.com/FreedomIntelligence/EchoX).
52
+
53
+ ### Simple Inference
54
+ ```python
55
+ from echox.inference_solver import FlexARInferenceSolver
56
+ from echox.utils import load_audio
57
+
58
+ # ******************** Speech-to-Speech Generation ********************
59
+ inference_solver = FlexARInferenceSolver(
60
+ model_path="FreedomIntelligence/EchoX-8B", # or FreedomIntelligence/EchoX-3B
61
+ precision="bf16",
62
+ target_size=768,
63
+ )
64
+
65
+ # Load your audio file
66
+ audio_file = "path/to/your/audio.wav"
67
+ audio_tensor = load_audio(audio_file)
68
+
69
+ # Prepare prompt
70
+ q1 = f"Please read the audio you just received, then provide a detailed description and answer the question asked in the audio. <|audio|>"
71
+
72
+ # Perform inference
73
+ generated = inference_solver.generate(
74
+ audios=[audio_tensor],
75
+ qas=[[q1, None]],
76
+ max_gen_len=8192,
77
+ temperature=0.7,
78
+ # logits_processor=inference_solver.create_logits_processor(cfg=4.0, audio_top_k=2000), # optional
79
+ )
80
+
81
+ a1, new_audio = generated[0], generated[1][0]
82
+ print(f"Generated text: {a1}")
83
+ # Save the generated audio (if any)
84
+ if new_audio is not None:
85
+ # `new_audio` is a torch.Tensor, save it to a .wav file
86
+ # For example: torchaudio.save("output.wav", new_audio.cpu(), 16000)
87
+ pass
88
+ ```
89
 
90
  # <span>📖 Citation</span>
91
  ```