|
|
<!DOCTYPE html> |
|
|
<html> |
|
|
<head> |
|
|
<meta charset="utf-8"> |
|
|
<meta name="description" |
|
|
content="RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning"> |
|
|
<meta name="keywords" content="RISE, VLM, Vision-Language Models, Image Annotation, Chain of Thought, CoT"> |
|
|
<meta name="viewport" content="width=device-width, initial-scale=1"> |
|
|
<title>RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning</title> |
|
|
|
|
|
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" |
|
|
rel="stylesheet"> |
|
|
|
|
|
<link rel="stylesheet" href="./static/css/bulma.min.css"> |
|
|
<link rel="stylesheet" href="./static/css/bulma-carousel.min.css"> |
|
|
<link rel="stylesheet" href="./static/css/bulma-slider.min.css"> |
|
|
<link rel="stylesheet" href="./static/css/fontawesome.all.min.css"> |
|
|
<link rel="stylesheet" |
|
|
href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css"> |
|
|
<link rel="stylesheet" href="./static/css/index.css"> |
|
|
<link rel="icon" href="./static/images/favicon.svg"> |
|
|
|
|
|
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script> |
|
|
<script defer src="./static/js/fontawesome.all.min.js"></script> |
|
|
<script src="./static/js/bulma-carousel.min.js"></script> |
|
|
<script src="./static/js/bulma-slider.min.js"></script> |
|
|
<script src="./static/js/index.js"></script> |
|
|
</head> |
|
|
<body> |
|
|
|
|
|
<section class="hero"> |
|
|
<div class="hero-body"> |
|
|
<div class="container is-max-desktop"> |
|
|
<div class="columns is-centered"> |
|
|
<div class="column has-text-centered"> |
|
|
<h1 class="title is-1 publication-title">RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning</h1> |
|
|
<div class="is-size-5 publication-authors"> |
|
|
<span class="author-block"> |
|
|
<a href="#" target="_blank">Suhang Hu</a><sup>*</sup>,</span> |
|
|
<span class="author-block"> |
|
|
<a href="#" target="_blank">Wei Hu</a><sup>†</sup>,</span> |
|
|
<span class="author-block"> |
|
|
<a href="#" target="_blank">Yuhang Su</a>, |
|
|
</span> |
|
|
<span class="author-block"> |
|
|
<a href="#" target="_blank">Fan Zhang</a> |
|
|
</span> |
|
|
</div> |
|
|
|
|
|
<div class="is-size-5 publication-authors"> |
|
|
<span class="author-block">Beijing University of Chemical Technology</span> |
|
|
</div> |
|
|
|
|
|
<div class="column has-text-centered"> |
|
|
<div class="publication-links"> |
|
|
|
|
|
<span class="link-block"> |
|
|
<a href="static/images/RISE.pdf" target="_blank" |
|
|
class="external-link button is-normal is-rounded is-dark"> |
|
|
<span class="icon"> |
|
|
<i class="fas fa-file-pdf"></i> |
|
|
</span> |
|
|
<span>Paper</span> |
|
|
</a> |
|
|
</span> |
|
|
<span class="link-block"> |
|
|
<a href="https://arxiv.org/abs/2508.13229" target="_blank" |
|
|
class="external-link button is-normal is-rounded is-dark"> |
|
|
<span class="icon"> |
|
|
<i class="ai ai-arxiv"></i> |
|
|
</span> |
|
|
<span>arXiv</span> |
|
|
</a> |
|
|
</span> |
|
|
|
|
|
<span class="link-block"> |
|
|
<a href="https://github.com/HSH55/RISE" target="_blank" |
|
|
class="external-link button is-normal is-rounded is-dark"> |
|
|
<span class="icon"> |
|
|
<i class="fab fa-github"></i> |
|
|
</span> |
|
|
<span>Code</span> |
|
|
</a> |
|
|
</span> |
|
|
</div> |
|
|
</div> |
|
|
</div> |
|
|
</div> |
|
|
</div> |
|
|
</div> |
|
|
</section> |
|
|
|
|
|
|
|
|
|
|
|
<section class="section"> |
|
|
<div class="container is-max-desktop"> |
|
|
|
|
|
<div class="columns is-centered has-text-centered"> |
|
|
<div class="column is-four-fifths"> |
|
|
<h2 class="title is-3">Abstract</h2> |
|
|
<div class="content has-text-justified"> |
|
|
<p> |
|
|
Vision-Language Models (VLMs) struggle with complex image annotation tasks, such as emotion classification and context-driven object detection, which demand sophisticated reasoning. Standard Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales, while Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training. |
|
|
</p> |
|
|
<p> |
|
|
We introduce <strong>RISE</strong> (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations. In the <strong>Reason</strong> stage (RISE-CoT), a reinforcement learning-driven "annotation-reasoning-annotation" closed-loop generates visually grounded, logically consistent CoTs by verifying their ability to reconstruct original annotations without direct leakage. The <strong>Inspire</strong> and <strong>Strengthen</strong> stage (RISE-R1) leverages a high-quality CoT subset for supervised fine-tuning, followed by reinforcement fine-tuning to produce interpretable reasoning and accurate annotations. |
|
|
</p> |
|
|
<p> |
|
|
Evaluated on complex and simple image annotation tasks, RISE-trained Qwen2-VL-2B outperforms SFT and Visual-RFT, achieving robust performance and enhanced explainability. RISE offers a self-supervised solution for advancing VLM reasoning without requiring manually annotated CoTs. |
|
|
</p> |
|
|
</div> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
</div> |
|
|
</section> |
|
|
|
|
|
<section class="section"> |
|
|
<div class="container is-max-desktop"> |
|
|
<h2 class="title is-3">RISE Framework</h2> |
|
|
|
|
|
<div class="columns is-centered"> |
|
|
<div class="column is-full-width"> |
|
|
<h3 class="title is-4">Two-Stage Approach</h3> |
|
|
|
|
|
<div class="content has-text-justified"> |
|
|
<p> |
|
|
RISE operates through two stages to enhance VLM reasoning capabilities for image annotation tasks: |
|
|
</p> |
|
|
|
|
|
<h4 class="title is-5">1. RISE-CoT: Closed-Loop Reasoning Generation</h4> |
|
|
<p> |
|
|
This stage generates high-quality, visually grounded Chains of Thought (CoTs) for image-annotation pairs in a self-supervised manner. The process involves: |
|
|
</p> |
|
|
<ul> |
|
|
<li><strong>Reasoning Generation:</strong> VLM produces a CoT justifying the annotation without leaking specifics</li> |
|
|
<li><strong>Annotation Reconstruction:</strong> VLM reconstructs the annotation from the generated CoT</li> |
|
|
<li><strong>Consistency Validation:</strong> Reward function evaluates CoT quality based on reconstruction accuracy</li> |
|
|
</ul> |
|
|
|
|
|
<div class="content has-text-centered"> |
|
|
<img src="./static/images/Rise-cot.png" alt="RISE Framework Diagram" style="width: 100%;"> |
|
|
<p class="has-text-centered">Figure 1: RISE-CoT framework</p> |
|
|
</div> |
|
|
|
|
|
<h4 class="title is-5">2. RISE-R1: Training VLM for Enhanced CoTs</h4> |
|
|
<p> |
|
|
This stage trains the VLM to produce structured "think-answer" outputs: |
|
|
</p> |
|
|
<ul> |
|
|
<li><strong>Inspire (SFT):</strong> Supervised fine-tuning on high-quality CoT subset</li> |
|
|
<li><strong>Strengthen (RFT):</strong> Reinforcement fine-tuning on full dataset to optimize task-specific outputs</li> |
|
|
</ul> |
|
|
</div> |
|
|
|
|
|
<div class="content has-text-centered"> |
|
|
<img src="./static/images/Rise-r1.png" alt="RISE Framework Diagram" style="width: 100%;"> |
|
|
<p class="has-text-centered">Figure 2: RISE-R1 framework</p> |
|
|
</div> |
|
|
|
|
|
|
|
|
</div> |
|
|
</div> |
|
|
</div> |
|
|
</section> |
|
|
|
|
|
<section class="section"> |
|
|
<div class="container is-max-desktop"> |
|
|
<h2 class="title is-3">Experiments & Results</h2> |
|
|
|
|
|
<div class="columns is-centered"> |
|
|
<div class="column"> |
|
|
<h3 class="title is-4">Datasets</h3> |
|
|
<div class="content has-text-justified"> |
|
|
<p>We evaluated RISE on four image annotation datasets with varying complexity:</p> |
|
|
<ul> |
|
|
<li><strong>Emotion6:</strong> Emotion classification with probability distributions</li> |
|
|
<li><strong>LISA:</strong> Context-driven object detection</li> |
|
|
<li><strong>ImageNet-Sub:</strong> Simple classification task</li> |
|
|
<li><strong>COCO-Sub:</strong> Multi-target object detection</li> |
|
|
</ul> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
<div class="column"> |
|
|
<h3 class="title is-4">Key Results</h3> |
|
|
<div class="content has-text-justified"> |
|
|
<p>RISE demonstrates superior performance across both complex and simple tasks:</p> |
|
|
<ul> |
|
|
<li>Outperforms SFT and Visual-RFT on Emotion6 and LISA</li> |
|
|
<li>Achieves robust performance on ImageNet-Sub and COCO-Sub</li> |
|
|
<li>Generates high-quality, interpretable Chains of Thought</li> |
|
|
<li>Provides self-supervised solution without manual CoT annotation</li> |
|
|
</ul> |
|
|
</div> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
|
|
|
</div> |
|
|
</section> |
|
|
|
|
|
<section class="section"> |
|
|
<div class="container is-max-desktop"> |
|
|
<h2 class="title is-3">Ablation Studies</h2> |
|
|
|
|
|
<div class="content has-text-justified"> |
|
|
<p>Our ablation studies confirm the importance of key RISE components:</p> |
|
|
<ul> |
|
|
<li><strong>CoT Quality:</strong> RISE-CoT generates higher quality CoTs compared to Base-Model and GPT-4o</li> |
|
|
<li><strong>SFT Initialization:</strong> SFT on high-quality CoT subset is crucial for RFT success</li> |
|
|
<li><strong>Reward Function:</strong> Full reward function (with leakage prevention and format constraints) achieves best performance</li> |
|
|
<li><strong>Threshold Selection:</strong> τ=0.75 optimally balances CoT quality and dataset size</li> |
|
|
</ul> |
|
|
</div> |
|
|
|
|
|
|
|
|
</div> |
|
|
</section> |
|
|
|
|
|
<section class="section"> |
|
|
<div class="container is-max-desktop"> |
|
|
<h2 class="title is-3">Conclusion</h2> |
|
|
|
|
|
<div class="content has-text-justified"> |
|
|
<p> |
|
|
We introduced RISE, a novel two-stage framework that significantly enhances VLMs for complex image annotation tasks. |
|
|
RISE autonomously generates high-quality CoTs by verifying their ability to reconstruct original annotations, then uses |
|
|
these CoTs to train VLMs to produce accurate and interpretable "think-answer" outputs directly from images. |
|
|
</p> |
|
|
<p> |
|
|
Through its verifiable, self-supervised CoT generation, RISE improves annotation accuracy and interpretability while |
|
|
uniquely enabling implicit evaluation and refinement of dataset annotation quality. This framework effectively boosts |
|
|
the reasoning capabilities of lower-capacity VLMs across various image annotation tasks, allowing them to perform akin to larger models. |
|
|
</p> |
|
|
</div> |
|
|
</div> |
|
|
</section> |
|
|
|
|
|
<section class="section" id="BibTeX"> |
|
|
<div class="container is-max-desktop content"> |
|
|
<h2 class="title">BibTeX</h2> |
|
|
<pre><code>@article{hu2024rise, |
|
|
title={RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning}, |
|
|
author={Hu, Suhang and Hu, Wei and Su, Yuhang and Zhang, Fan}, |
|
|
journal={arXiv preprint}, |
|
|
year={2024} |
|
|
}</code></pre> |
|
|
</div> |
|
|
</section> |
|
|
|
|
|
<footer class="footer"> |
|
|
<div class="container"> |
|
|
<div class="content has-text-centered"> |
|
|
<p> |
|
|
This website is licensed under a <a rel="license" target="_blank" |
|
|
href="http://creativecommons.org/licenses/by-sa/4.0/">Creative |
|
|
Commons Attribution-ShareAlike 4.0 International License</a>. |
|
|
</p> |
|
|
</div> |
|
|
</div> |
|
|
</footer> |
|
|
|
|
|
</body> |
|
|
</html> |