dipta007 commited on
Commit
0d91fd4
·
verified ·
1 Parent(s): bc0c0ef

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +268 -0
README.md ADDED
@@ -0,0 +1,268 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: image-text-to-text
6
+ tags:
7
+ - multimodal
8
+ - video-caption-evaluation
9
+ - reference-free
10
+ - factual-analysis
11
+ - vision-language
12
+ library_name: transformers
13
+ base_model: Qwen/Qwen2.5-VL-3B-Instruct
14
+ datasets:
15
+ - dipta007/ActivityNet-FG-It
16
+ arxiv: 2509.16538
17
+ ---
18
+
19
+ # VC-Inspector-3B
20
+
21
+ <a href="https://arxiv.org/abs/2509.16538" target="_blank">
22
+ <img alt="arXiv" src="https://img.shields.io/badge/arXiv-2509.16538-b31b1b.svg" style="display: inline-block; vertical-align: middle;"/>
23
+ </a>
24
+ <a href="https://huggingface.co/collections/dipta007/vc-inspector" target="_blank">
25
+ <img alt="Models" src="https://img.shields.io/badge/HuggingFace-Models-orange" style="display: inline-block; vertical-align: middle;"/>
26
+ </a>
27
+ <a href="https://huggingface.co/datasets/dipta007/ActivityNet-FG-It" target="_blank">
28
+ <img alt="Dataset" src="https://img.shields.io/badge/HuggingFace-Dataset-blue" style="display: inline-block; vertical-align: middle;"/>
29
+ </a>
30
+
31
+ ## Introduction
32
+
33
+ **VC-Inspector-3B** is a lightweight, open-source large multimodal model (LMM) for **reference-free evaluation of video captions** with a focus on **factual accuracy**. This is the smaller, more efficient variant of VC-Inspector, ideal for resource-constrained environments while still achieving strong performance.
34
+
35
+ Unlike existing metrics that suffer from limited context handling, weak factuality assessment, or reliance on proprietary services, VC-Inspector offers a reproducible, fact-aware alternative that aligns closely with human judgments.
36
+
37
+ This model is fine-tuned from [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) using LoRA on our synthetic dataset [ActivityNet-FG-It](https://huggingface.co/datasets/dipta007/ActivityNet-FG-It), which contains 44K video-caption pairs with controlled factual errors and quality annotations.
38
+
39
+ ### Key Features
40
+
41
+ - **Lightweight**: Only 3B parameters - suitable for on-device or resource-constrained deployment
42
+ - **Reference-free Evaluation**: Evaluates video captions without requiring ground-truth references
43
+ - **Factual Grounding**: Detects factual errors in objects and actions within captions
44
+ - **Interpretable Outputs**: Generates quality scores (1-5) with natural language explanations
45
+ - **Cross-domain Generalization**: Works on both video and image caption evaluation
46
+ - **Fast Inference**: 0.30 seconds per video clip on A100 GPU
47
+
48
+ ### Model Architecture
49
+
50
+ VC-Inspector-3B is built on Qwen2.5-VL-3B-Instruct with the following modifications:
51
+ - **Vision Encoder**: Frozen (preserves generalization)
52
+ - **Visual-Language Projector**: Frozen
53
+ - **LLM Component**: Fine-tuned with LoRA (rank=32, alpha=32)
54
+
55
+ ## Evaluation Results
56
+
57
+ ### Correlation with Human Judgments on VATEX-Eval
58
+
59
+ | Metric | Type | Kendall's τ_b | Spearman's ρ |
60
+ |:-------|:-----|:-------------:|:------------:|
61
+ | EMScore | Reference-free | 22.88 | 29.79 |
62
+ | CLIPScore | Reference-free | 22.33 | 29.09 |
63
+ | ViCLIPScore | Reference-free | 30.92 | 39.86 |
64
+ | Qwen2.5-VL-3B (base) | Reference-free | 31.29 | 36.43 |
65
+ | G-VEval (GPT-4o) | Reference-free | 39.40 | - |
66
+ | **VC-Inspector-3B** | Reference-free | **37.99** | **42.45** |
67
+
68
+ ### Cross-domain Evaluation on Image Caption Benchmarks
69
+
70
+ | Metric | Flickr8K-Expert (τ_b) | Flickr8K-CF (τ_b) |
71
+ |:-------|:---------------------:|:-----------------:|
72
+ | CLIPScore (ref-free) | 51.10 | 34.40 |
73
+ | PAC-S (ref-free) | 53.90 | 36.00 |
74
+ | **VC-Inspector-3B** | **59.86** | **39.00** |
75
+
76
+ ### Synthetic Dataset Evaluation
77
+
78
+ | Dataset | Kendall's τ_b | Spearman's ρ |
79
+ |:--------|:-------------:|:------------:|
80
+ | ActivityNet-FG-Eval | 49.53 | 62.01 |
81
+ | YouCook2-FG-Eval | 44.29 | 55.31 |
82
+
83
+ ### Computational Efficiency
84
+
85
+ | Metric | Time per clip (A100) |
86
+ |:-------|:--------------------:|
87
+ | EMScore | 0.42s |
88
+ | ViCLIPScore | 0.34s |
89
+ | **VC-Inspector-3B** | **0.30s** |
90
+
91
+ ## Requirements
92
+
93
+ ```bash
94
+ pip install torch transformers accelerate
95
+ pip install qwen-vl-utils[decord]==0.0.8
96
+ pip install flash-attn --no-build-isolation
97
+ ```
98
+
99
+ ## Quickstart
100
+
101
+ ### Using Transformers
102
+
103
+ ```python
104
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
105
+ from qwen_vl_utils import process_vision_info
106
+
107
+ # Load model
108
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
109
+ "dipta007/VCInspector-3B",
110
+ torch_dtype="auto",
111
+ device_map="auto",
112
+ )
113
+ processor = AutoProcessor.from_pretrained("dipta007/VCInspector-3B")
114
+
115
+ # Prepare input
116
+ caption = "A man is playing guitar in a field"
117
+ prompt = f"""<caption>{caption}</caption>
118
+
119
+ You are given a video and a caption describing the video content. Please rate the helpfulness, relevance, accuracy, level of details of the caption. The overall score should be on a scale of 1 to 5, where a higher score indicates better overall performance. Please first output a single line containing only one integer indicating the score. In the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias. STRICTLY FOLLOW THE FORMAT."""
120
+
121
+ messages = [
122
+ {
123
+ "role": "user",
124
+ "content": [
125
+ {"type": "video", "video": "path/to/video.mp4", "max_pixels": 360 * 420, "fps": 1.0},
126
+ {"type": "text", "text": prompt},
127
+ ],
128
+ }
129
+ ]
130
+
131
+ # Process and generate
132
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
133
+ image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
134
+ inputs = processor(
135
+ text=[text],
136
+ images=image_inputs,
137
+ videos=video_inputs,
138
+ padding=True,
139
+ return_tensors="pt",
140
+ **video_kwargs,
141
+ )
142
+ inputs = inputs.to("cuda")
143
+
144
+ generated_ids = model.generate(**inputs, max_new_tokens=256)
145
+ generated_ids_trimmed = [
146
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
147
+ ]
148
+ output_text = processor.batch_decode(
149
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
150
+ )
151
+ print(output_text[0])
152
+ ```
153
+
154
+ ### Example Output
155
+
156
+ ```
157
+ 4
158
+ The caption does not accurately capture the video content. For example, the objects (guitar) are incorrect.
159
+ ```
160
+
161
+ ### Using with ms-swift (vLLM backend)
162
+
163
+ ```python
164
+ from swift.llm import VllmEngine, InferRequest, RequestConfig
165
+ import os
166
+
167
+ os.environ["VIDEO_MAX_PIXELS"] = "50176"
168
+ os.environ["FPS_MAX_FRAMES"] = "12"
169
+
170
+ engine = VllmEngine(
171
+ "dipta007/VCInspector-3B",
172
+ max_model_len=32768,
173
+ limit_mm_per_prompt={"image": 32}
174
+ )
175
+
176
+ # Prepare request
177
+ request = InferRequest(
178
+ messages=[{"role": "user", "content": f"<image>\n{prompt}"}],
179
+ images=["frame1.jpg", "frame2.jpg", ...] # Video frames
180
+ )
181
+ config = RequestConfig(max_tokens=256, temperature=0.0)
182
+ response = engine.infer([request], config)
183
+ print(response[0].choices[0].message.content)
184
+ ```
185
+
186
+ ## Output Format
187
+
188
+ VC-Inspector outputs two components:
189
+
190
+ 1. **Quality Score** (Line 1): Integer from 1-5
191
+ - 5: Caption is accurate and comprehensive
192
+ - 4: Minor factual errors
193
+ - 3: Moderate factual errors
194
+ - 2: Significant factual errors
195
+ - 1: Major factual errors or completely incorrect
196
+
197
+ 2. **Explanation** (Line 2+): Natural language explanation identifying:
198
+ - Incorrect objects (e.g., "guitar" instead of "violin")
199
+ - Incorrect actions (e.g., "running" instead of "walking")
200
+
201
+ ## Training Details
202
+
203
+ | Hyperparameter | Value |
204
+ |:---------------|:------|
205
+ | Base Model | Qwen2.5-VL-3B-Instruct |
206
+ | Training Data | ActivityNet-FG-It (44K samples) |
207
+ | Epochs | 1 |
208
+ | Global Batch Size | 128 |
209
+ | Learning Rate | 1e-4 |
210
+ | LR Scheduler | Cosine (min: 1e-5) |
211
+ | LoRA Rank | 32 |
212
+ | LoRA Alpha | 32 |
213
+ | LoRA Dropout | 0.05 |
214
+ | Number of Frames | 32 |
215
+ | Training Time | ~32 GPU hours (A100) |
216
+
217
+ ## Ablation Studies
218
+
219
+ ### Impact of Explanation Supervision
220
+
221
+ | Setting | Kendall's τ_b | Spearman's ρ |
222
+ |:--------|:-------------:|:------------:|
223
+ | Without Explanations | 34.29 | 38.18 |
224
+ | **With Explanations** | **37.99** | **42.45** |
225
+
226
+ ### Data Synthesis Strategy
227
+
228
+ | Strategy | Kendall's τ_b | Spearman's ρ |
229
+ |:---------|:-------------:|:------------:|
230
+ | Change objects only | 36.40 | 41.20 |
231
+ | Change actions only | 33.23 | 39.63 |
232
+ | **Change both (Ours)** | **37.99** | **42.45** |
233
+
234
+ ## When to Use VC-Inspector-3B vs 7B
235
+
236
+ | Use Case | Recommended Model |
237
+ |:---------|:------------------|
238
+ | Resource-constrained environments | **3B** |
239
+ | On-device deployment | **3B** |
240
+ | Batch processing large datasets | **3B** |
241
+ | Maximum accuracy required | 7B |
242
+ | Research benchmarking | 7B |
243
+
244
+ ## Limitations
245
+
246
+ - Primarily targets object and action correctness; attributes, spatial relationships, and fine-grained temporal ordering are not explicitly modeled
247
+ - Training relies on synthetically generated captions and pseudo-scores
248
+ - Slightly lower performance than the 7B variant on challenging cases
249
+
250
+ ## Citation
251
+
252
+ If you find this work useful, please cite our paper:
253
+
254
+ ```bibtex
255
+ @misc{dipta2025advancingreferencefreeevaluationvideo,
256
+ title={Advancing Reference-free Evaluation of Video Captions with Factual Analysis},
257
+ author={Shubhashis Roy Dipta and Tz-Ying Wu and Subarna Tripathi},
258
+ year={2025},
259
+ eprint={2509.16538},
260
+ archivePrefix={arXiv},
261
+ primaryClass={cs.CV},
262
+ url={https://arxiv.org/abs/2509.16538},
263
+ }
264
+ ```
265
+
266
+ ## Acknowledgements
267
+
268
+ This work builds upon [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) and uses [ms-swift](https://github.com/modelscope/ms-swift) for training.