XACLE-TMU-2026

Large Audio Language Model for Audio-Text Alignment Score Prediction

This model was developed for the XACLE Challenge by the Tokyo Metropolitan University (TMU) Shiota Laboratory. It is described in the paper The TMU System for the XACLE Challenge: Training Large Audio Language Models with CLAP Pseudo-Labels.

Official Code Repository: GitHub

Model Description

XACLE-TMU is a Large Audio Language Model (LALM) that predicts alignment scores between general audio and text captions. The architecture consists of:

  • BEATs audio encoder (90M params, frozen)
  • SwiGLU MLP audio projection with gated residual (10M params)
  • Qwen2.5-0.5B-Instruct LLM backbone (494M params)
  • MLP Score Head for alignment score prediction

Total Parameters: ~594M

Performance

The system secured third place in the XACLE challenge team ranking.

Split SRCC
Validation 0.6746
Test (Final Ensemble) 0.632

Usage

To use this model, you need to install the dependencies from the official repository. You also need to download the BEATs_iter3+ (AS2M) checkpoint as described in the README.

from tmu_xacle.model.xacle_model import XACLEModel

# Load pre-trained model from Hugging Face
model = XACLEModel.from_pretrained(
    "Atotti/xacle-tmu-2026",
    beats_checkpoint="checkpoints/BEATs_iter3_plus_AS2M.pt", # Path to downloaded BEATs checkpoint
    device="cuda",
)

# Predict alignment score
# The model predicts a score representing the semantic alignment
score = model.predict("audio.wav", "A dog barking in the park")
print(f"Alignment Score: {score:.2f}") 

Architecture

The model processes 16kHz audio waveforms through a frozen BEATs encoder. These features are projected into the LLM's embedding space. The Qwen2.5 backbone processes the combined text and audio tokens, and the hidden state of a specific [SCORE] token is passed to an MLP head to regress the final alignment score.

Training

The model was trained using a three-stage pipeline:

  1. Stage 1: Automated audio captioning (AAC) pretraining.
  2. Stage 2: Pretraining with CLAP pseudo-labels (identified as the primary performance driver).
  3. Stage 3: Fine-tuning on the XACLE dataset using ListNet loss.

Citation

@article{tsutsumi2026tmu,
  title={The TMU System for the XACLE Challenge: Training Large Audio Language Models with CLAP Pseudo-Labels},
  author={Tsutsumi, Ayuto and Tanaka, Kohei and Shiota, Sayaka},
  journal={arXiv preprint arXiv:2602.00604},
  year={2026}
}

License

This project is licensed under the CC-BY-NC-4.0 license.

Downloads last month
29
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Atotti/xacle-tmu-2026