nielsr HF Staff commited on
Commit
b6ee060
·
verified ·
1 Parent(s): 5130ff3

Update model card: add paper link, correct pipeline tag, and refine usage instructions

Browse files

This PR improves the model card by:
- Linking the model to its corresponding paper: [The TMU System for the XACLE Challenge: Training Large Audio Language Models with CLAP Pseudo-Labels](https://huggingface.co/papers/2602.00604).
- Updating the `pipeline_tag` to `audio-text-to-text` to better represent the Large Audio Language Model (LALM) architecture.
- Adding the official GitHub repository link.
- Updating the code usage example to include the necessary BEATs checkpoint path, following the instructions found in the GitHub repository.
- Adding the test set performance results from the paper.

Files changed (1) hide show
  1. README.md +33 -45
README.md CHANGED
@@ -1,98 +1,86 @@
1
  ---
2
  language:
3
  - en
 
 
 
4
  tags:
5
  - audio
6
  - audio-text-alignment
7
  - xacle
8
  - lalm
9
- library_name: pytorch
10
- pipeline_tag: audio-classification
11
- license: cc-by-nc-4.0
12
  ---
13
 
14
  # XACLE-TMU-2026
15
 
16
  **Large Audio Language Model for Audio-Text Alignment Score Prediction**
17
 
18
- This model was developed for the [XACLE Challenge](https://xacle.org/) by TMU.
19
 
20
- For detailed usage instructions, please refer to [GitHub](https://github.com/shiotalab-tmu/tmu-xacle2026).
21
 
22
  ## Model Description
23
 
24
- XACLE-TMU is a Large Audio Language Model (LALM) that predicts alignment scores between audio and text captions. The model combines:
25
 
26
  - **BEATs** audio encoder (90M params, frozen)
27
  - **SwiGLU MLP** audio projection with gated residual (10M params)
28
  - **Qwen2.5-0.5B-Instruct** LLM backbone (494M params)
29
- - **MLP Score Head** for score prediction
30
 
31
- **Total: ~594M parameters**
32
 
33
  ## Performance
34
 
 
 
35
  | Split | SRCC |
36
  |-------|------|
37
  | Validation | **0.6746** |
 
38
 
39
  ## Usage
40
 
 
 
41
  ```python
42
  from tmu_xacle.model.xacle_model import XACLEModel
43
 
44
- # Load model
45
- model = XACLEModel.from_pretrained("Atotti/xacle-tmu-2026", device="cuda")
 
 
 
 
46
 
47
  # Predict alignment score
 
48
  score = model.predict("audio.wav", "A dog barking in the park")
49
- print(f"Alignment Score: {score:.2f}") # Score in [0, 10]
50
  ```
51
 
52
  ## Architecture
53
 
54
- ```
55
- Audio Waveform (16kHz)
56
- |
57
- BEATs Encoder (frozen)
58
- [B, 500, 768]
59
- |
60
- SwiGLU MLP + Gated Residual
61
- [B, 100, 896]
62
- |
63
- [TEXT] [AUDIO_START] [AUDIO] [AUDIO_END] [SCORE] [EOS]
64
- |
65
- Qwen2.5-0.5B-Instruct
66
- |
67
- [SCORE] Token Hidden State
68
- [B, 896]
69
- |
70
- MLP Score Head (896 -> 512 -> 128)
71
- |
72
- Linear (128 -> 1)
73
- |
74
- Alignment Score [-1, 1] -> [0, 10]
75
- ```
76
 
77
  ## Training
78
 
79
- The model was trained in 3 stages:
80
- 1. **Stage 1**: Audio Captioning Pretraining (skipped, using pretrained components)
81
- 2. **Stage 2**: CLAP Pseudo-Label Pretraining
82
- 3. **Stage 3**: XACLE Fine-tuning with ListNet loss
83
-
84
- Training details:
85
- - Optimizer: AdamW (lr=6.2e-6)
86
- - Loss: ListNet Top-1 Loss
87
- - SpecAugment: freqm=15, timem=30
88
- - Epochs: 50
89
 
90
  ## Citation
91
 
92
  ```bibtex
93
- wip
 
 
 
 
 
94
  ```
95
 
96
  ## License
97
 
98
- CC-BY-NC-4.0
 
1
  ---
2
  language:
3
  - en
4
+ library_name: pytorch
5
+ license: cc-by-nc-4.0
6
+ pipeline_tag: audio-text-to-text
7
  tags:
8
  - audio
9
  - audio-text-alignment
10
  - xacle
11
  - lalm
 
 
 
12
  ---
13
 
14
  # XACLE-TMU-2026
15
 
16
  **Large Audio Language Model for Audio-Text Alignment Score Prediction**
17
 
18
+ This model was developed for the [XACLE Challenge](https://xacle.org/) by the Tokyo Metropolitan University (TMU) Shiota Laboratory. It is described in the paper [The TMU System for the XACLE Challenge: Training Large Audio Language Models with CLAP Pseudo-Labels](https://huggingface.co/papers/2602.00604).
19
 
20
+ Official Code Repository: [GitHub](https://github.com/shiotalab-tmu/tmu-xacle2026)
21
 
22
  ## Model Description
23
 
24
+ XACLE-TMU is a Large Audio Language Model (LALM) that predicts alignment scores between general audio and text captions. The architecture consists of:
25
 
26
  - **BEATs** audio encoder (90M params, frozen)
27
  - **SwiGLU MLP** audio projection with gated residual (10M params)
28
  - **Qwen2.5-0.5B-Instruct** LLM backbone (494M params)
29
+ - **MLP Score Head** for alignment score prediction
30
 
31
+ **Total Parameters: ~594M**
32
 
33
  ## Performance
34
 
35
+ The system secured third place in the XACLE challenge team ranking.
36
+
37
  | Split | SRCC |
38
  |-------|------|
39
  | Validation | **0.6746** |
40
+ | Test (Final Ensemble) | **0.632** |
41
 
42
  ## Usage
43
 
44
+ To use this model, you need to install the dependencies from the [official repository](https://github.com/shiotalab-tmu/tmu-xacle2026). You also need to download the BEATs_iter3+ (AS2M) checkpoint as described in the README.
45
+
46
  ```python
47
  from tmu_xacle.model.xacle_model import XACLEModel
48
 
49
+ # Load pre-trained model from Hugging Face
50
+ model = XACLEModel.from_pretrained(
51
+ "Atotti/xacle-tmu-2026",
52
+ beats_checkpoint="checkpoints/BEATs_iter3_plus_AS2M.pt", # Path to downloaded BEATs checkpoint
53
+ device="cuda",
54
+ )
55
 
56
  # Predict alignment score
57
+ # The model predicts a score representing the semantic alignment
58
  score = model.predict("audio.wav", "A dog barking in the park")
59
+ print(f"Alignment Score: {score:.2f}")
60
  ```
61
 
62
  ## Architecture
63
 
64
+ The model processes 16kHz audio waveforms through a frozen BEATs encoder. These features are projected into the LLM's embedding space. The Qwen2.5 backbone processes the combined text and audio tokens, and the hidden state of a specific `[SCORE]` token is passed to an MLP head to regress the final alignment score.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
  ## Training
67
 
68
+ The model was trained using a three-stage pipeline:
69
+ 1. **Stage 1**: Automated audio captioning (AAC) pretraining.
70
+ 2. **Stage 2**: Pretraining with CLAP pseudo-labels (identified as the primary performance driver).
71
+ 3. **Stage 3**: Fine-tuning on the XACLE dataset using ListNet loss.
 
 
 
 
 
 
72
 
73
  ## Citation
74
 
75
  ```bibtex
76
+ @article{shiota2026tmu,
77
+ title={The TMU System for the XACLE Challenge: Training Large Audio Language Models with CLAP Pseudo-Labels},
78
+ author={Shiota Lab TMU},
79
+ journal={arXiv preprint arXiv:2602.00604},
80
+ year={2026}
81
+ }
82
  ```
83
 
84
  ## License
85
 
86
+ This project is licensed under the CC-BY-NC-4.0 license.