XACLE-TMU-2026
Large Audio Language Model for Audio-Text Alignment Score Prediction
This model was developed for the XACLE Challenge by the Tokyo Metropolitan University (TMU) Shiota Laboratory. It is described in the paper The TMU System for the XACLE Challenge: Training Large Audio Language Models with CLAP Pseudo-Labels.
Official Code Repository: GitHub
Model Description
XACLE-TMU is a Large Audio Language Model (LALM) that predicts alignment scores between general audio and text captions. The architecture consists of:
- BEATs audio encoder (90M params, frozen)
- SwiGLU MLP audio projection with gated residual (10M params)
- Qwen2.5-0.5B-Instruct LLM backbone (494M params)
- MLP Score Head for alignment score prediction
Total Parameters: ~594M
Performance
The system secured third place in the XACLE challenge team ranking.
| Split | SRCC |
|---|---|
| Validation | 0.6746 |
| Test (Final Ensemble) | 0.632 |
Usage
To use this model, you need to install the dependencies from the official repository. You also need to download the BEATs_iter3+ (AS2M) checkpoint as described in the README.
from tmu_xacle.model.xacle_model import XACLEModel
# Load pre-trained model from Hugging Face
model = XACLEModel.from_pretrained(
"Atotti/xacle-tmu-2026",
beats_checkpoint="checkpoints/BEATs_iter3_plus_AS2M.pt", # Path to downloaded BEATs checkpoint
device="cuda",
)
# Predict alignment score
# The model predicts a score representing the semantic alignment
score = model.predict("audio.wav", "A dog barking in the park")
print(f"Alignment Score: {score:.2f}")
Architecture
The model processes 16kHz audio waveforms through a frozen BEATs encoder. These features are projected into the LLM's embedding space. The Qwen2.5 backbone processes the combined text and audio tokens, and the hidden state of a specific [SCORE] token is passed to an MLP head to regress the final alignment score.
Training
The model was trained using a three-stage pipeline:
- Stage 1: Automated audio captioning (AAC) pretraining.
- Stage 2: Pretraining with CLAP pseudo-labels (identified as the primary performance driver).
- Stage 3: Fine-tuning on the XACLE dataset using ListNet loss.
Citation
@article{tsutsumi2026tmu,
title={The TMU System for the XACLE Challenge: Training Large Audio Language Models with CLAP Pseudo-Labels},
author={Tsutsumi, Ayuto and Tanaka, Kohei and Shiota, Sayaka},
journal={arXiv preprint arXiv:2602.00604},
year={2026}
}
License
This project is licensed under the CC-BY-NC-4.0 license.
- Downloads last month
- 29