dipta007 commited on
Commit
d063cdb
·
verified ·
1 Parent(s): f966cc7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +230 -0
README.md ADDED
@@ -0,0 +1,230 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: image-text-to-text
6
+ tags:
7
+ - multimodal
8
+ - video-caption-evaluation
9
+ - reference-free
10
+ - factual-analysis
11
+ - vision-language
12
+ library_name: transformers
13
+ base_model: Qwen/Qwen2.5-VL-7B-Instruct
14
+ datasets:
15
+ - dipta007/ActivityNet-FG-It
16
+ arxiv: 2509.16538
17
+ ---
18
+
19
+ # VC-Inspector-7B
20
+
21
+ <a href="https://arxiv.org/abs/2509.16538" target="_blank">
22
+ <img alt="arXiv" src="https://img.shields.io/badge/arXiv-2509.16538-b31b1b.svg" style="display: inline-block; vertical-align: middle;"/>
23
+ </a>
24
+ <a href="https://huggingface.co/collections/dipta007/vc-inspector" target="_blank">
25
+ <img alt="Models" src="https://img.shields.io/badge/HuggingFace-Models-orange" style="display: inline-block; vertical-align: middle;"/>
26
+ </a>
27
+ <a href="https://huggingface.co/datasets/dipta007/ActivityNet-FG-It" target="_blank">
28
+ <img alt="Dataset" src="https://img.shields.io/badge/HuggingFace-Dataset-blue" style="display: inline-block; vertical-align: middle;"/>
29
+ </a>
30
+
31
+ ## Introduction
32
+
33
+ **VC-Inspector-7B** is a lightweight, open-source large multimodal model (LMM) for **reference-free evaluation of video captions** with a focus on **factual accuracy**. Unlike existing metrics that suffer from limited context handling, weak factuality assessment, or reliance on proprietary services, VC-Inspector offers a reproducible, fact-aware alternative that aligns closely with human judgments.
34
+
35
+ This model is fine-tuned from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) using LoRA on our synthetic dataset [ActivityNet-FG-It](https://huggingface.co/datasets/dipta007/ActivityNet-FG-It), which contains 44K video-caption pairs with controlled factual errors and quality annotations.
36
+
37
+ ### Key Features
38
+
39
+ - **Reference-free Evaluation**: Evaluates video captions without requiring ground-truth references
40
+ - **Factual Grounding**: Detects factual errors in objects and actions within captions
41
+ - **Interpretable Outputs**: Generates quality scores (1-5) with natural language explanations
42
+ - **Cross-domain Generalization**: Works on both video and image caption evaluation
43
+ - **State-of-the-art Performance**: Outperforms GPT-4o-based methods on VATEX-Eval
44
+
45
+ ### Model Architecture
46
+
47
+ VC-Inspector-7B is built on Qwen2.5-VL-7B-Instruct with the following modifications:
48
+ - **Vision Encoder**: Frozen (preserves generalization)
49
+ - **Visual-Language Projector**: Frozen
50
+ - **LLM Component**: Fine-tuned with LoRA (rank=32, alpha=32)
51
+
52
+ ## Evaluation Results
53
+
54
+ ### Correlation with Human Judgments on VATEX-Eval
55
+
56
+ | Metric | Type | Kendall's τ_b | Spearman's ρ |
57
+ |:-------|:-----|:-------------:|:------------:|
58
+ | EMScore | Reference-free | 22.88 | 29.79 |
59
+ | CLIPScore | Reference-free | 22.33 | 29.09 |
60
+ | ViCLIPScore | Reference-free | 30.92 | 39.86 |
61
+ | G-VEval (GPT-4o) | Reference-free | 39.40 | - |
62
+ | Qwen2.5-VL-7B (base) | Reference-free | 34.70 | 39.40 |
63
+ | **VC-Inspector-7B** | Reference-free | **42.58** | **45.99** |
64
+
65
+ ### Cross-domain Evaluation on Image Caption Benchmarks
66
+
67
+ | Metric | Flickr8K-Expert (τ_b) | Flickr8K-CF (τ_b) |
68
+ |:-------|:---------------------:|:-----------------:|
69
+ | CLIPScore (ref-free) | 51.10 | 34.40 |
70
+ | PAC-S (ref-free) | 53.90 | 36.00 |
71
+ | **VC-Inspector-7B** | **63.43** | **45.97** |
72
+
73
+ ### Synthetic Dataset Evaluation
74
+
75
+ | Dataset | Kendall's τ_b | Spearman's ρ |
76
+ |:--------|:-------------:|:------------:|
77
+ | ActivityNet-FG-Eval | 49.53 | 62.01 |
78
+ | YouCook2-FG-Eval | 44.29 | 55.31 |
79
+
80
+ ## Requirements
81
+
82
+ ```bash
83
+ pip install torch transformers accelerate
84
+ pip install qwen-vl-utils[decord]==0.0.8
85
+ pip install flash-attn --no-build-isolation
86
+ ```
87
+
88
+ ## Quickstart
89
+
90
+ ### Using Transformers
91
+
92
+ ```python
93
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
94
+ from qwen_vl_utils import process_vision_info
95
+
96
+ # Load model
97
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
98
+ "dipta007/VCInspector-7B",
99
+ torch_dtype="auto",
100
+ device_map="auto",
101
+ )
102
+ processor = AutoProcessor.from_pretrained("dipta007/VCInspector-7B")
103
+
104
+ # Prepare input
105
+ caption = "A man is playing guitar in a field"
106
+ prompt = f"""<caption>{caption}</caption>
107
+
108
+ You are given a video and a caption describing the video content. Please rate the helpfulness, relevance, accuracy, level of details of the caption. The overall score should be on a scale of 1 to 5, where a higher score indicates better overall performance. Please first output a single line containing only one integer indicating the score. In the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias. STRICTLY FOLLOW THE FORMAT."""
109
+
110
+ messages = [
111
+ {
112
+ "role": "user",
113
+ "content": [
114
+ {"type": "video", "video": "path/to/video.mp4", "max_pixels": 360 * 420, "fps": 1.0},
115
+ {"type": "text", "text": prompt},
116
+ ],
117
+ }
118
+ ]
119
+
120
+ # Process and generate
121
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
122
+ image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
123
+ inputs = processor(
124
+ text=[text],
125
+ images=image_inputs,
126
+ videos=video_inputs,
127
+ padding=True,
128
+ return_tensors="pt",
129
+ **video_kwargs,
130
+ )
131
+ inputs = inputs.to("cuda")
132
+
133
+ generated_ids = model.generate(**inputs, max_new_tokens=256)
134
+ generated_ids_trimmed = [
135
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
136
+ ]
137
+ output_text = processor.batch_decode(
138
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
139
+ )
140
+ print(output_text[0])
141
+ ```
142
+
143
+ ### Example Output
144
+
145
+ ```
146
+ 4
147
+ The caption does not accurately capture the video content. For example, the objects (guitar) are incorrect.
148
+ ```
149
+
150
+ ### Using with ms-swift (vLLM backend)
151
+
152
+ ```python
153
+ from swift.llm import VllmEngine, InferRequest, RequestConfig
154
+ import os
155
+
156
+ os.environ["VIDEO_MAX_PIXELS"] = "50176"
157
+ os.environ["FPS_MAX_FRAMES"] = "12"
158
+
159
+ engine = VllmEngine(
160
+ "dipta007/VCInspector-7B",
161
+ max_model_len=32768,
162
+ limit_mm_per_prompt={"image": 32}
163
+ )
164
+
165
+ # Prepare request
166
+ request = InferRequest(
167
+ messages=[{"role": "user", "content": f"<image>\n{prompt}"}],
168
+ images=["frame1.jpg", "frame2.jpg", ...] # Video frames
169
+ )
170
+ config = RequestConfig(max_tokens=256, temperature=0.0)
171
+ response = engine.infer([request], config)
172
+ print(response[0].choices[0].message.content)
173
+ ```
174
+
175
+ ## Output Format
176
+
177
+ VC-Inspector outputs two components:
178
+
179
+ 1. **Quality Score** (Line 1): Integer from 1-5
180
+ - 5: Caption is accurate and comprehensive
181
+ - 4: Minor factual errors
182
+ - 3: Moderate factual errors
183
+ - 2: Significant factual errors
184
+ - 1: Major factual errors or completely incorrect
185
+
186
+ 2. **Explanation** (Line 2+): Natural language explanation identifying:
187
+ - Incorrect objects (e.g., "guitar" instead of "violin")
188
+ - Incorrect actions (e.g., "running" instead of "walking")
189
+
190
+ ## Training Details
191
+
192
+ | Hyperparameter | Value |
193
+ |:---------------|:------|
194
+ | Base Model | Qwen2.5-VL-7B-Instruct |
195
+ | Training Data | ActivityNet-FG-It (44K samples) |
196
+ | Epochs | 1 |
197
+ | Global Batch Size | 128 |
198
+ | Learning Rate | 1e-4 |
199
+ | LR Scheduler | Cosine (min: 1e-5) |
200
+ | LoRA Rank | 32 |
201
+ | LoRA Alpha | 32 |
202
+ | LoRA Dropout | 0.05 |
203
+ | Number of Frames | 32 |
204
+ | Training Time | ~32 GPU hours (A100) |
205
+
206
+ ## Limitations
207
+
208
+ - Primarily targets object and action correctness; attributes, spatial relationships, and fine-grained temporal ordering are not explicitly modeled
209
+ - Training relies on synthetically generated captions and pseudo-scores
210
+ - Higher computational cost compared to embedding-based metrics (though more lightweight than GPT-4o)
211
+
212
+ ## Citation
213
+
214
+ If you find this work useful, please cite our paper:
215
+
216
+ ```bibtex
217
+ @misc{dipta2025advancingreferencefreeevaluationvideo,
218
+ title={Advancing Reference-free Evaluation of Video Captions with Factual Analysis},
219
+ author={Shubhashis Roy Dipta and Tz-Ying Wu and Subarna Tripathi},
220
+ year={2025},
221
+ eprint={2509.16538},
222
+ archivePrefix={arXiv},
223
+ primaryClass={cs.CV},
224
+ url={https://arxiv.org/abs/2509.16538},
225
+ }
226
+ ```
227
+
228
+ ## Acknowledgements
229
+
230
+ This work builds upon [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) and uses [ms-swift](https://github.com/modelscope/ms-swift) for training.