Update model card: add paper link, correct pipeline tag, and refine usage instructions

This PR improves the model card by:
- Linking the model to its corresponding paper: [The TMU System for the XACLE Challenge: Training Large Audio Language Models with CLAP Pseudo-Labels](https://huggingface.co/papers/2602.00604).
- Updating the `pipeline_tag` to `audio-text-to-text` to better represent the Large Audio Language Model (LALM) architecture.
- Adding the official GitHub repository link.
- Updating the code usage example to include the necessary BEATs checkpoint path, following the instructions found in the GitHub repository.
- Adding the test set performance results from the paper.

Files changed (1) hide show

README.md +33 -45

README.md CHANGED Viewed

@@ -1,98 +1,86 @@
 ---
 language:
 - en
 tags:
 - audio
 - audio-text-alignment
 - xacle
 - lalm
-library_name: pytorch
-pipeline_tag: audio-classification
-license: cc-by-nc-4.0
 ---
 # XACLE-TMU-2026
 **Large Audio Language Model for Audio-Text Alignment Score Prediction**
-This model was developed for the [XACLE Challenge](https://xacle.org/) by TMU.
-For detailed usage instructions, please refer to [GitHub](https://github.com/shiotalab-tmu/tmu-xacle2026).
 ## Model Description
-XACLE-TMU is a Large Audio Language Model (LALM) that predicts alignment scores between audio and text captions. The model combines:
 - **BEATs** audio encoder (90M params, frozen)
 - **SwiGLU MLP** audio projection with gated residual (10M params)
 - **Qwen2.5-0.5B-Instruct** LLM backbone (494M params)
-- **MLP Score Head** for score prediction
-**Total: ~594M parameters**
 ## Performance
 | Split | SRCC |
 |-------|------|
 | Validation | **0.6746** |
 ## Usage
 ```python
 from tmu_xacle.model.xacle_model import XACLEModel
-# Load model
-model = XACLEModel.from_pretrained("Atotti/xacle-tmu-2026", device="cuda")
 # Predict alignment score
 score = model.predict("audio.wav", "A dog barking in the park")
-print(f"Alignment Score: {score:.2f}")  # Score in [0, 10]
 ```
 ## Architecture
-```
-Audio Waveform (16kHz)
-       |
-  BEATs Encoder (frozen)
-  [B, 500, 768]
-       |
-  SwiGLU MLP + Gated Residual
-  [B, 100, 896]
-       |
-  [TEXT] [AUDIO_START] [AUDIO] [AUDIO_END] [SCORE] [EOS]
-       |
-  Qwen2.5-0.5B-Instruct
-       |
-  [SCORE] Token Hidden State
-  [B, 896]
-       |
-  MLP Score Head (896 -> 512 -> 128)
-       |
-  Linear (128 -> 1)
-       |
-  Alignment Score [-1, 1] -> [0, 10]
-```
 ## Training
-The model was trained in 3 stages:
-1. **Stage 1**: Audio Captioning Pretraining (skipped, using pretrained components)
-2. **Stage 2**: CLAP Pseudo-Label Pretraining
-3. **Stage 3**: XACLE Fine-tuning with ListNet loss
-Training details:
-- Optimizer: AdamW (lr=6.2e-6)
-- Loss: ListNet Top-1 Loss
-- SpecAugment: freqm=15, timem=30
-- Epochs: 50
 ## Citation
 ```bibtex
-wip
 ```
 ## License
-CC-BY-NC-4.0

 ---
 language:
 - en
+library_name: pytorch
+license: cc-by-nc-4.0
+pipeline_tag: audio-text-to-text
 tags:
 - audio
 - audio-text-alignment
 - xacle
 - lalm
 ---
 # XACLE-TMU-2026
 **Large Audio Language Model for Audio-Text Alignment Score Prediction**
+This model was developed for the [XACLE Challenge](https://xacle.org/) by the Tokyo Metropolitan University (TMU) Shiota Laboratory. It is described in the paper [The TMU System for the XACLE Challenge: Training Large Audio Language Models with CLAP Pseudo-Labels](https://huggingface.co/papers/2602.00604).
+Official Code Repository: [GitHub](https://github.com/shiotalab-tmu/tmu-xacle2026)
 ## Model Description
+XACLE-TMU is a Large Audio Language Model (LALM) that predicts alignment scores between general audio and text captions. The architecture consists of:
 - **BEATs** audio encoder (90M params, frozen)
 - **SwiGLU MLP** audio projection with gated residual (10M params)
 - **Qwen2.5-0.5B-Instruct** LLM backbone (494M params)
+- **MLP Score Head** for alignment score prediction
+**Total Parameters: ~594M**
 ## Performance
+The system secured third place in the XACLE challenge team ranking.
 | Split | SRCC |
 |-------|------|
 | Validation | **0.6746** |
+| Test (Final Ensemble) | **0.632** |
 ## Usage
+To use this model, you need to install the dependencies from the [official repository](https://github.com/shiotalab-tmu/tmu-xacle2026). You also need to download the BEATs_iter3+ (AS2M) checkpoint as described in the README.
 ```python
 from tmu_xacle.model.xacle_model import XACLEModel
+# Load pre-trained model from Hugging Face
+model = XACLEModel.from_pretrained(
+    "Atotti/xacle-tmu-2026",
+    beats_checkpoint="checkpoints/BEATs_iter3_plus_AS2M.pt", # Path to downloaded BEATs checkpoint
+    device="cuda",
+)
 # Predict alignment score
+# The model predicts a score representing the semantic alignment
 score = model.predict("audio.wav", "A dog barking in the park")
+print(f"Alignment Score: {score:.2f}")
 ```
 ## Architecture
+The model processes 16kHz audio waveforms through a frozen BEATs encoder. These features are projected into the LLM's embedding space. The Qwen2.5 backbone processes the combined text and audio tokens, and the hidden state of a specific `[SCORE]` token is passed to an MLP head to regress the final alignment score.
 ## Training
+The model was trained using a three-stage pipeline:
+1. **Stage 1**: Automated audio captioning (AAC) pretraining.
+2. **Stage 2**: Pretraining with CLAP pseudo-labels (identified as the primary performance driver).
+3. **Stage 3**: Fine-tuning on the XACLE dataset using ListNet loss.
 ## Citation
 ```bibtex
+@article{shiota2026tmu,
+  title={The TMU System for the XACLE Challenge: Training Large Audio Language Models with CLAP Pseudo-Labels},
+  author={Shiota Lab TMU},
+  journal={arXiv preprint arXiv:2602.00604},
+  year={2026}
+}
 ```
 ## License
+This project is licensed under the CC-BY-NC-4.0 license.