Atotti commited on
Commit
a2f9b5d
·
verified ·
1 Parent(s): 2c0fe29

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +101 -0
  2. config.json +10 -0
  3. model.pt +3 -0
README.md ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - audio
7
+ - audio-text-alignment
8
+ - xacle
9
+ - lalm
10
+ library_name: pytorch
11
+ pipeline_tag: audio-classification
12
+ ---
13
+
14
+ # XACLE-TMU-2026
15
+
16
+ **Large Audio Language Model for Audio-Text Alignment Score Prediction**
17
+
18
+ This model was developed for the [XACLE Challenge](https://xacle-challenge.github.io/) by Tokyo Metropolitan University.
19
+
20
+ ## Model Description
21
+
22
+ XACLE-TMU is a Large Audio Language Model (LALM) that predicts alignment scores between audio and text captions. The model combines:
23
+
24
+ - **BEATs** audio encoder (90M params, frozen)
25
+ - **SwiGLU MLP** audio projection with gated residual (10M params)
26
+ - **Qwen2.5-0.5B-Instruct** LLM backbone (494M params)
27
+ - **MLP Score Head** for score prediction
28
+
29
+ **Total: ~594M parameters**
30
+
31
+ ## Performance
32
+
33
+ | Split | SRCC |
34
+ |-------|------|
35
+ | Validation | **0.6746** |
36
+
37
+ ## Usage
38
+
39
+ ```python
40
+ from tmu_xacle.model.xacle_model import XACLEModel
41
+
42
+ # Load model
43
+ model = XACLEModel.from_pretrained("Atotti/xacle-tmu-2026", device="cuda")
44
+
45
+ # Predict alignment score
46
+ score = model.predict("audio.wav", "A dog barking in the park")
47
+ print(f"Alignment Score: {score:.2f}") # Score in [0, 10]
48
+ ```
49
+
50
+ ## Architecture
51
+
52
+ ```
53
+ Audio Waveform (16kHz)
54
+ |
55
+ BEATs Encoder (frozen)
56
+ [B, 500, 768]
57
+ |
58
+ SwiGLU MLP + Gated Residual
59
+ [B, 100, 896]
60
+ |
61
+ [TEXT] [AUDIO_START] [AUDIO] [AUDIO_END] [SCORE] [EOS]
62
+ |
63
+ Qwen2.5-0.5B-Instruct
64
+ |
65
+ [SCORE] Token Hidden State
66
+ [B, 896]
67
+ |
68
+ MLP Score Head (896 -> 512 -> 128)
69
+ |
70
+ Linear (128 -> 1)
71
+ |
72
+ Alignment Score [-1, 1] -> [0, 10]
73
+ ```
74
+
75
+ ## Training
76
+
77
+ The model was trained in 3 stages:
78
+ 1. **Stage 1**: Audio Captioning Pretraining (skipped, using pretrained components)
79
+ 2. **Stage 2**: CLAP Pseudo-Label Pretraining
80
+ 3. **Stage 3**: XACLE Fine-tuning with ListNet loss
81
+
82
+ Training details:
83
+ - Optimizer: AdamW (lr=6.2e-6)
84
+ - Loss: ListNet Top-1 Loss
85
+ - SpecAugment: freqm=15, timem=30
86
+ - Epochs: 50
87
+
88
+ ## Citation
89
+
90
+ ```bibtex
91
+ @inproceedings{xacle2026tmu,
92
+ title={TMU System for XACLE Challenge 2026},
93
+ author={Tokyo Metropolitan University},
94
+ booktitle={XACLE Challenge Workshop},
95
+ year={2026}
96
+ }
97
+ ```
98
+
99
+ ## License
100
+
101
+ Apache 2.0
config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "xacle",
3
+ "llm_model_name": "Qwen/Qwen2.5-0.5B-Instruct",
4
+ "audio_encoder": "BEATs",
5
+ "audio_dim": 768,
6
+ "llm_hidden_dim": 896,
7
+ "num_audio_tokens": 100,
8
+ "intermediate_dim": 3584,
9
+ "val_srcc": 0.6746
10
+ }
model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e775d749d577761624d0a3096da449718c3f6c87accf2144a8fc096076fd3f05
3
+ size 2380172135