PEFT
Safetensors
llama
audio
video
segmentation
mask-quality-assessment
audio-visual-segmentation
lora
Instructions to use Jinxing1/MQ-Auditor with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Jinxing1/MQ-Auditor with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("/home/panwen.hu/workspace1/jinxing.zhou/mllm/Crab/pretrained_weights/Llama-2-7b-chat-hf") model = PeftModel.from_pretrained(base_model, "Jinxing1/MQ-Auditor") - Notebooks
- Google Colab
- Kaggle
| base_model: meta-llama/Llama-2-7b-chat-hf | |
| library_name: peft | |
| license: cc-by-nc-sa-4.0 | |
| tags: | |
| - audio | |
| - video | |
| - segmentation | |
| - mask-quality-assessment | |
| - audio-visual-segmentation | |
| - lora | |
| # MQ-Auditor HyperLoRA Weights | |
| This repository contains the released MQ-Auditor pretrained weights for reference-free mask quality assessment in language-referred audio-visual segmentation. | |
| The checkpoint corresponds to: | |
| ```text | |
| epochs96_lr1e-4_bs4_gradacc8_lora_r32alpha64_pos0.5_ioulosswei0 | |
| ``` | |
| ## Model | |
| MQ-Auditor takes a video clip, audio, a referring expression, a frame, and a candidate segmentation mask, then predicts mask quality attributes such as mask type, IoU, and recommended action. | |
| The released weights are intended to be used with the MQ-Auditor codebase and MQ-RAVSBench dataset. The base LLM checkpoint and external encoders are not included in this package. | |
| ## Release Contents | |
| The public weight package should include: | |
| ```text | |
| adapter_config.json | |
| adapter_model.safetensors | |
| config.json | |
| model.txt | |
| model_trainable_params.txt | |
| non_lora_trainables.bin | |
| saved_config.json | |
| trainer_state.json | |
| checkpoint-960/ | |
| config.json | |
| finetune_weights.bin | |
| ``` | |
| Intermediate epoch checkpoints and TensorBoard logs are not part of the release package. | |
| ## Training Data | |
| The model was trained on MQ-RAVSBench with: | |
| ```text | |
| train_test_meta_files/metadata.csv | |
| train_test_meta_files/train_audit_only_filtered.json | |
| ``` | |
| `null` masks are used during training as empty-mask examples. They are not part of the default/reported test-time evaluation protocol. | |
| ## Evaluation | |
| Evaluation is reported on the seen and unseen MQ-RAVSBench test splits: | |
| ```text | |
| test_s_image_filtered.json | |
| test_u_image_filtered.json | |
| test_s_video_filtered.json | |
| test_u_video_filtered.json | |
| ``` | |
| Reported mask types focus on non-empty candidate masks: `perfect`, `cutout`, `erode`, `dilate`, `merge`, and `full_neg`. | |
| ## License | |
| The released MQ-Auditor weights are provided for non-commercial research purposes only under CC BY-NC-SA 4.0-style terms. The weights depend on the Llama-2 base model and other pretrained encoders, so users must also comply with the applicable upstream model licenses and access terms. | |
| ## Citation | |
| ```bibtex | |
| @article{zhou2026audit, | |
| title={Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation}, | |
| author={Zhou, Jinxing and Zhou, Yanghao and Wang, Yaoting and Han, Zongyan and Ma, Jiaqi and Ding, Henghui and Anwer, Rao Muhammad and Cholakkal, Hisham}, | |
| journal={arXiv preprint arXiv:2602.03892}, | |
| year={2026} | |
| } | |
| ``` | |
| Paper: https://arxiv.org/pdf/2602.03892 | |