File size: 2,620 Bytes
3826697
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
base_model: meta-llama/Llama-2-7b-chat-hf
library_name: peft
license: cc-by-nc-sa-4.0
tags:
- audio
- video
- segmentation
- mask-quality-assessment
- audio-visual-segmentation
- lora
---

# MQ-Auditor HyperLoRA Weights

This repository contains the released MQ-Auditor pretrained weights for reference-free mask quality assessment in language-referred audio-visual segmentation.

The checkpoint corresponds to:

```text
epochs96_lr1e-4_bs4_gradacc8_lora_r32alpha64_pos0.5_ioulosswei0
```

## Model

MQ-Auditor takes a video clip, audio, a referring expression, a frame, and a candidate segmentation mask, then predicts mask quality attributes such as mask type, IoU, and recommended action.

The released weights are intended to be used with the MQ-Auditor codebase and MQ-RAVSBench dataset. The base LLM checkpoint and external encoders are not included in this package.

## Release Contents

The public weight package should include:

```text
adapter_config.json
adapter_model.safetensors
config.json
model.txt
model_trainable_params.txt
non_lora_trainables.bin
saved_config.json
trainer_state.json
checkpoint-960/
  config.json
  finetune_weights.bin
```

Intermediate epoch checkpoints and TensorBoard logs are not part of the release package.

## Training Data

The model was trained on MQ-RAVSBench with:

```text
train_test_meta_files/metadata.csv
train_test_meta_files/train_audit_only_filtered.json
```

`null` masks are used during training as empty-mask examples. They are not part of the default/reported test-time evaluation protocol.

## Evaluation

Evaluation is reported on the seen and unseen MQ-RAVSBench test splits:

```text
test_s_image_filtered.json
test_u_image_filtered.json
test_s_video_filtered.json
test_u_video_filtered.json
```

Reported mask types focus on non-empty candidate masks: `perfect`, `cutout`, `erode`, `dilate`, `merge`, and `full_neg`.

## License

The released MQ-Auditor weights are provided for non-commercial research purposes only under CC BY-NC-SA 4.0-style terms. The weights depend on the Llama-2 base model and other pretrained encoders, so users must also comply with the applicable upstream model licenses and access terms.

## Citation

```bibtex
@article{zhou2026audit,
  title={Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation},
  author={Zhou, Jinxing and Zhou, Yanghao and Wang, Yaoting and Han, Zongyan and Ma, Jiaqi and Ding, Henghui and Anwer, Rao Muhammad and Cholakkal, Hisham},
  journal={arXiv preprint arXiv:2602.03892},
  year={2026}
}
```

Paper: https://arxiv.org/pdf/2602.03892