Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

README.md +101 -0
config.json +10 -0
model.pt +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,101 @@

+---
+license: apache-2.0
+language:
+- en
+tags:
+- audio
+- audio-text-alignment
+- xacle
+- lalm
+library_name: pytorch
+pipeline_tag: audio-classification
+---
+# XACLE-TMU-2026
+**Large Audio Language Model for Audio-Text Alignment Score Prediction**
+This model was developed for the [XACLE Challenge](https://xacle-challenge.github.io/) by Tokyo Metropolitan University.
+## Model Description
+XACLE-TMU is a Large Audio Language Model (LALM) that predicts alignment scores between audio and text captions. The model combines:
+- **BEATs** audio encoder (90M params, frozen)
+- **SwiGLU MLP** audio projection with gated residual (10M params)
+- **Qwen2.5-0.5B-Instruct** LLM backbone (494M params)
+- **MLP Score Head** for score prediction
+**Total: ~594M parameters**
+## Performance
+| Split | SRCC |
+|-------|------|
+| Validation | **0.6746** |
+## Usage
+```python
+from tmu_xacle.model.xacle_model import XACLEModel
+# Load model
+model = XACLEModel.from_pretrained("Atotti/xacle-tmu-2026", device="cuda")
+# Predict alignment score
+score = model.predict("audio.wav", "A dog barking in the park")
+print(f"Alignment Score: {score:.2f}")  # Score in [0, 10]
+```
+## Architecture
+```
+Audio Waveform (16kHz)
+       |
+  BEATs Encoder (frozen)
+  [B, 500, 768]
+       |
+  SwiGLU MLP + Gated Residual
+  [B, 100, 896]
+       |
+  [TEXT] [AUDIO_START] [AUDIO] [AUDIO_END] [SCORE] [EOS]
+       |
+  Qwen2.5-0.5B-Instruct
+       |
+  [SCORE] Token Hidden State
+  [B, 896]
+       |
+  MLP Score Head (896 -> 512 -> 128)
+       |
+  Linear (128 -> 1)
+       |
+  Alignment Score [-1, 1] -> [0, 10]
+```
+## Training
+The model was trained in 3 stages:
+1. **Stage 1**: Audio Captioning Pretraining (skipped, using pretrained components)
+2. **Stage 2**: CLAP Pseudo-Label Pretraining
+3. **Stage 3**: XACLE Fine-tuning with ListNet loss
+Training details:
+- Optimizer: AdamW (lr=6.2e-6)
+- Loss: ListNet Top-1 Loss
+- SpecAugment: freqm=15, timem=30
+- Epochs: 50
+## Citation
+```bibtex
+@inproceedings{xacle2026tmu,
+  title={TMU System for XACLE Challenge 2026},
+  author={Tokyo Metropolitan University},
+  booktitle={XACLE Challenge Workshop},
+  year={2026}
+}
+```
+## License
+Apache 2.0

config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "model_type": "xacle",
+  "llm_model_name": "Qwen/Qwen2.5-0.5B-Instruct",
+  "audio_encoder": "BEATs",
+  "audio_dim": 768,
+  "llm_hidden_dim": 896,
+  "num_audio_tokens": 100,
+  "intermediate_dim": 3584,
+  "val_srcc": 0.6746
+}

model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e775d749d577761624d0a3096da449718c3f6c87accf2144a8fc096076fd3f05
+size 2380172135