File size: 11,865 Bytes
d5c10a8 7df7c06 d5c10a8 7df7c06 d5c10a8 7df7c06 d5c10a8 ad31175 d5c10a8 7df7c06 d5c10a8 7df7c06 d5c10a8 7df7c06 d5c10a8 ad31175 d5c10a8 cf37dc3 d5c10a8 e29b2f8 d5c10a8 e29b2f8 d5c10a8 a3aec87 d5c10a8 7df7c06 d5c10a8 7df7c06 d5c10a8 7df7c06 d5c10a8 7df7c06 d5c10a8 7df7c06 d5c10a8 7df7c06 d5c10a8 7df7c06 a3aec87 ad31175 a3aec87 7df7c06 d5c10a8 7df7c06 d5c10a8 7df7c06 d5c10a8 7df7c06 d5c10a8 b34587f a3aec87 d5c10a8 ad31175 b34587f d5c10a8 7df7c06 d5c10a8 7df7c06 d5c10a8 7df7c06 d5c10a8 7df7c06 d5c10a8 7df7c06 d5c10a8 7df7c06 d376d77 d5c10a8 7df7c06 d5c10a8 7df7c06 d5c10a8 7df7c06 d5c10a8 7df7c06 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 |
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="description"
content="RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning">
<meta name="keywords" content="RISE, VLM, Vision-Language Models, Image Annotation, Chain of Thought, CoT">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning</title>
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
rel="stylesheet">
<link rel="stylesheet" href="./static/css/bulma.min.css">
<link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="./static/css/bulma-slider.min.css">
<link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
<link rel="stylesheet"
href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="./static/css/index.css">
<link rel="icon" href="./static/images/favicon.svg">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script defer src="./static/js/fontawesome.all.min.js"></script>
<script src="./static/js/bulma-carousel.min.js"></script>
<script src="./static/js/bulma-slider.min.js"></script>
<script src="./static/js/index.js"></script>
</head>
<body>
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title">RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning</h1>
<div class="is-size-5 publication-authors">
<span class="author-block">
<a href="#" target="_blank">Suhang Hu</a><sup>*</sup>,</span>
<span class="author-block">
<a href="#" target="_blank">Wei Hu</a><sup>†</sup>,</span>
<span class="author-block">
<a href="#" target="_blank">Yuhang Su</a>,
</span>
<span class="author-block">
<a href="#" target="_blank">Fan Zhang</a>
</span>
</div>
<div class="is-size-5 publication-authors">
<span class="author-block">Beijing University of Chemical Technology</span>
</div>
<div class="column has-text-centered">
<div class="publication-links">
<!-- PDF Link. -->
<span class="link-block">
<a href="static/images/RISE.pdf" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fas fa-file-pdf"></i>
</span>
<span>Paper</span>
</a>
</span>
<span class="link-block">
<a href="https://arxiv.org/abs/2508.13229" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="ai ai-arxiv"></i>
</span>
<span>arXiv</span>
</a>
</span>
<!-- Code Link. -->
<span class="link-block">
<a href="https://github.com/HSH55/RISE" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Code</span>
</a>
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<!-- Abstract. -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p>
Vision-Language Models (VLMs) struggle with complex image annotation tasks, such as emotion classification and context-driven object detection, which demand sophisticated reasoning. Standard Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales, while Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training.
</p>
<p>
We introduce <strong>RISE</strong> (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations. In the <strong>Reason</strong> stage (RISE-CoT), a reinforcement learning-driven "annotation-reasoning-annotation" closed-loop generates visually grounded, logically consistent CoTs by verifying their ability to reconstruct original annotations without direct leakage. The <strong>Inspire</strong> and <strong>Strengthen</strong> stage (RISE-R1) leverages a high-quality CoT subset for supervised fine-tuning, followed by reinforcement fine-tuning to produce interpretable reasoning and accurate annotations.
</p>
<p>
Evaluated on complex and simple image annotation tasks, RISE-trained Qwen2-VL-2B outperforms SFT and Visual-RFT, achieving robust performance and enhanced explainability. RISE offers a self-supervised solution for advancing VLM reasoning without requiring manually annotated CoTs.
</p>
</div>
</div>
</div>
<!--/ Abstract. -->
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<h2 class="title is-3">RISE Framework</h2>
<div class="columns is-centered">
<div class="column is-full-width">
<h3 class="title is-4">Two-Stage Approach</h3>
<div class="content has-text-justified">
<p>
RISE operates through two stages to enhance VLM reasoning capabilities for image annotation tasks:
</p>
<h4 class="title is-5">1. RISE-CoT: Closed-Loop Reasoning Generation</h4>
<p>
This stage generates high-quality, visually grounded Chains of Thought (CoTs) for image-annotation pairs in a self-supervised manner. The process involves:
</p>
<ul>
<li><strong>Reasoning Generation:</strong> VLM produces a CoT justifying the annotation without leaking specifics</li>
<li><strong>Annotation Reconstruction:</strong> VLM reconstructs the annotation from the generated CoT</li>
<li><strong>Consistency Validation:</strong> Reward function evaluates CoT quality based on reconstruction accuracy</li>
</ul>
<div class="content has-text-centered">
<img src="./static/images/Rise-cot.png" alt="RISE Framework Diagram" style="width: 100%;">
<p class="has-text-centered">Figure 1: RISE-CoT framework</p>
</div>
<h4 class="title is-5">2. RISE-R1: Training VLM for Enhanced CoTs</h4>
<p>
This stage trains the VLM to produce structured "think-answer" outputs:
</p>
<ul>
<li><strong>Inspire (SFT):</strong> Supervised fine-tuning on high-quality CoT subset</li>
<li><strong>Strengthen (RFT):</strong> Reinforcement fine-tuning on full dataset to optimize task-specific outputs</li>
</ul>
</div>
<div class="content has-text-centered">
<img src="./static/images/Rise-r1.png" alt="RISE Framework Diagram" style="width: 100%;">
<p class="has-text-centered">Figure 2: RISE-R1 framework</p>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<h2 class="title is-3">Experiments & Results</h2>
<div class="columns is-centered">
<div class="column">
<h3 class="title is-4">Datasets</h3>
<div class="content has-text-justified">
<p>We evaluated RISE on four image annotation datasets with varying complexity:</p>
<ul>
<li><strong>Emotion6:</strong> Emotion classification with probability distributions</li>
<li><strong>LISA:</strong> Context-driven object detection</li>
<li><strong>ImageNet-Sub:</strong> Simple classification task</li>
<li><strong>COCO-Sub:</strong> Multi-target object detection</li>
</ul>
</div>
</div>
<div class="column">
<h3 class="title is-4">Key Results</h3>
<div class="content has-text-justified">
<p>RISE demonstrates superior performance across both complex and simple tasks:</p>
<ul>
<li>Outperforms SFT and Visual-RFT on Emotion6 and LISA</li>
<li>Achieves robust performance on ImageNet-Sub and COCO-Sub</li>
<li>Generates high-quality, interpretable Chains of Thought</li>
<li>Provides self-supervised solution without manual CoT annotation</li>
</ul>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<h2 class="title is-3">Ablation Studies</h2>
<div class="content has-text-justified">
<p>Our ablation studies confirm the importance of key RISE components:</p>
<ul>
<li><strong>CoT Quality:</strong> RISE-CoT generates higher quality CoTs compared to Base-Model and GPT-4o</li>
<li><strong>SFT Initialization:</strong> SFT on high-quality CoT subset is crucial for RFT success</li>
<li><strong>Reward Function:</strong> Full reward function (with leakage prevention and format constraints) achieves best performance</li>
<li><strong>Threshold Selection:</strong> τ=0.75 optimally balances CoT quality and dataset size</li>
</ul>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<h2 class="title is-3">Conclusion</h2>
<div class="content has-text-justified">
<p>
We introduced RISE, a novel two-stage framework that significantly enhances VLMs for complex image annotation tasks.
RISE autonomously generates high-quality CoTs by verifying their ability to reconstruct original annotations, then uses
these CoTs to train VLMs to produce accurate and interpretable "think-answer" outputs directly from images.
</p>
<p>
Through its verifiable, self-supervised CoT generation, RISE improves annotation accuracy and interpretability while
uniquely enabling implicit evaluation and refinement of dataset annotation quality. This framework effectively boosts
the reasoning capabilities of lower-capacity VLMs across various image annotation tasks, allowing them to perform akin to larger models.
</p>
</div>
</div>
</section>
<section class="section" id="BibTeX">
<div class="container is-max-desktop content">
<h2 class="title">BibTeX</h2>
<pre><code>@article{hu2024rise,
title={RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning},
author={Hu, Suhang and Hu, Wei and Su, Yuhang and Zhang, Fan},
journal={arXiv preprint},
year={2024}
}</code></pre>
</div>
</section>
<footer class="footer">
<div class="container">
<div class="content has-text-centered">
<p>
This website is licensed under a <a rel="license" target="_blank"
href="http://creativecommons.org/licenses/by-sa/4.0/">Creative
Commons Attribution-ShareAlike 4.0 International License</a>.
</p>
</div>
</div>
</footer>
</body>
</html> |