RISE / index.html
mldlb's picture
Update index.html
d376d77 verified
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="description"
content="RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning">
<meta name="keywords" content="RISE, VLM, Vision-Language Models, Image Annotation, Chain of Thought, CoT">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning</title>
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
rel="stylesheet">
<link rel="stylesheet" href="./static/css/bulma.min.css">
<link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="./static/css/bulma-slider.min.css">
<link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
<link rel="stylesheet"
href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="./static/css/index.css">
<link rel="icon" href="./static/images/favicon.svg">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script defer src="./static/js/fontawesome.all.min.js"></script>
<script src="./static/js/bulma-carousel.min.js"></script>
<script src="./static/js/bulma-slider.min.js"></script>
<script src="./static/js/index.js"></script>
</head>
<body>
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title">RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning</h1>
<div class="is-size-5 publication-authors">
<span class="author-block">
<a href="#" target="_blank">Suhang Hu</a><sup>*</sup>,</span>
<span class="author-block">
<a href="#" target="_blank">Wei Hu</a><sup></sup>,</span>
<span class="author-block">
<a href="#" target="_blank">Yuhang Su</a>,
</span>
<span class="author-block">
<a href="#" target="_blank">Fan Zhang</a>
</span>
</div>
<div class="is-size-5 publication-authors">
<span class="author-block">Beijing University of Chemical Technology</span>
</div>
<div class="column has-text-centered">
<div class="publication-links">
<!-- PDF Link. -->
<span class="link-block">
<a href="static/images/RISE.pdf" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fas fa-file-pdf"></i>
</span>
<span>Paper</span>
</a>
</span>
<span class="link-block">
<a href="https://arxiv.org/abs/2508.13229" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="ai ai-arxiv"></i>
</span>
<span>arXiv</span>
</a>
</span>
<!-- Code Link. -->
<span class="link-block">
<a href="https://github.com/HSH55/RISE" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Code</span>
</a>
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<!-- Abstract. -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p>
Vision-Language Models (VLMs) struggle with complex image annotation tasks, such as emotion classification and context-driven object detection, which demand sophisticated reasoning. Standard Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales, while Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training.
</p>
<p>
We introduce <strong>RISE</strong> (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations. In the <strong>Reason</strong> stage (RISE-CoT), a reinforcement learning-driven "annotation-reasoning-annotation" closed-loop generates visually grounded, logically consistent CoTs by verifying their ability to reconstruct original annotations without direct leakage. The <strong>Inspire</strong> and <strong>Strengthen</strong> stage (RISE-R1) leverages a high-quality CoT subset for supervised fine-tuning, followed by reinforcement fine-tuning to produce interpretable reasoning and accurate annotations.
</p>
<p>
Evaluated on complex and simple image annotation tasks, RISE-trained Qwen2-VL-2B outperforms SFT and Visual-RFT, achieving robust performance and enhanced explainability. RISE offers a self-supervised solution for advancing VLM reasoning without requiring manually annotated CoTs.
</p>
</div>
</div>
</div>
<!--/ Abstract. -->
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<h2 class="title is-3">RISE Framework</h2>
<div class="columns is-centered">
<div class="column is-full-width">
<h3 class="title is-4">Two-Stage Approach</h3>
<div class="content has-text-justified">
<p>
RISE operates through two stages to enhance VLM reasoning capabilities for image annotation tasks:
</p>
<h4 class="title is-5">1. RISE-CoT: Closed-Loop Reasoning Generation</h4>
<p>
This stage generates high-quality, visually grounded Chains of Thought (CoTs) for image-annotation pairs in a self-supervised manner. The process involves:
</p>
<ul>
<li><strong>Reasoning Generation:</strong> VLM produces a CoT justifying the annotation without leaking specifics</li>
<li><strong>Annotation Reconstruction:</strong> VLM reconstructs the annotation from the generated CoT</li>
<li><strong>Consistency Validation:</strong> Reward function evaluates CoT quality based on reconstruction accuracy</li>
</ul>
<div class="content has-text-centered">
<img src="./static/images/Rise-cot.png" alt="RISE Framework Diagram" style="width: 100%;">
<p class="has-text-centered">Figure 1: RISE-CoT framework</p>
</div>
<h4 class="title is-5">2. RISE-R1: Training VLM for Enhanced CoTs</h4>
<p>
This stage trains the VLM to produce structured "think-answer" outputs:
</p>
<ul>
<li><strong>Inspire (SFT):</strong> Supervised fine-tuning on high-quality CoT subset</li>
<li><strong>Strengthen (RFT):</strong> Reinforcement fine-tuning on full dataset to optimize task-specific outputs</li>
</ul>
</div>
<div class="content has-text-centered">
<img src="./static/images/Rise-r1.png" alt="RISE Framework Diagram" style="width: 100%;">
<p class="has-text-centered">Figure 2: RISE-R1 framework</p>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<h2 class="title is-3">Experiments & Results</h2>
<div class="columns is-centered">
<div class="column">
<h3 class="title is-4">Datasets</h3>
<div class="content has-text-justified">
<p>We evaluated RISE on four image annotation datasets with varying complexity:</p>
<ul>
<li><strong>Emotion6:</strong> Emotion classification with probability distributions</li>
<li><strong>LISA:</strong> Context-driven object detection</li>
<li><strong>ImageNet-Sub:</strong> Simple classification task</li>
<li><strong>COCO-Sub:</strong> Multi-target object detection</li>
</ul>
</div>
</div>
<div class="column">
<h3 class="title is-4">Key Results</h3>
<div class="content has-text-justified">
<p>RISE demonstrates superior performance across both complex and simple tasks:</p>
<ul>
<li>Outperforms SFT and Visual-RFT on Emotion6 and LISA</li>
<li>Achieves robust performance on ImageNet-Sub and COCO-Sub</li>
<li>Generates high-quality, interpretable Chains of Thought</li>
<li>Provides self-supervised solution without manual CoT annotation</li>
</ul>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<h2 class="title is-3">Ablation Studies</h2>
<div class="content has-text-justified">
<p>Our ablation studies confirm the importance of key RISE components:</p>
<ul>
<li><strong>CoT Quality:</strong> RISE-CoT generates higher quality CoTs compared to Base-Model and GPT-4o</li>
<li><strong>SFT Initialization:</strong> SFT on high-quality CoT subset is crucial for RFT success</li>
<li><strong>Reward Function:</strong> Full reward function (with leakage prevention and format constraints) achieves best performance</li>
<li><strong>Threshold Selection:</strong> τ=0.75 optimally balances CoT quality and dataset size</li>
</ul>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<h2 class="title is-3">Conclusion</h2>
<div class="content has-text-justified">
<p>
We introduced RISE, a novel two-stage framework that significantly enhances VLMs for complex image annotation tasks.
RISE autonomously generates high-quality CoTs by verifying their ability to reconstruct original annotations, then uses
these CoTs to train VLMs to produce accurate and interpretable "think-answer" outputs directly from images.
</p>
<p>
Through its verifiable, self-supervised CoT generation, RISE improves annotation accuracy and interpretability while
uniquely enabling implicit evaluation and refinement of dataset annotation quality. This framework effectively boosts
the reasoning capabilities of lower-capacity VLMs across various image annotation tasks, allowing them to perform akin to larger models.
</p>
</div>
</div>
</section>
<section class="section" id="BibTeX">
<div class="container is-max-desktop content">
<h2 class="title">BibTeX</h2>
<pre><code>@article{hu2024rise,
title={RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning},
author={Hu, Suhang and Hu, Wei and Su, Yuhang and Zhang, Fan},
journal={arXiv preprint},
year={2024}
}</code></pre>
</div>
</section>
<footer class="footer">
<div class="container">
<div class="content has-text-centered">
<p>
This website is licensed under a <a rel="license" target="_blank"
href="http://creativecommons.org/licenses/by-sa/4.0/">Creative
Commons Attribution-ShareAlike 4.0 International License</a>.
</p>
</div>
</div>
</footer>
</body>
</html>