Spaces:

mldlb
/

RISE

Running

App Files Files Community

RISE / index.html

mldlb

Update index.html

d376d77 verified 4 months ago

raw

history blame contribute delete

11.9 kB

	<!DOCTYPE html>
	<html>
	<head>
	<meta charset="utf-8">
	<meta name="description"
	content="RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning">
	<meta name="keywords" content="RISE, VLM, Vision-Language Models, Image Annotation, Chain of Thought, CoT">
	<meta name="viewport" content="width=device-width, initial-scale=1">
	<title>RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning</title>

	<link href="https://fonts.googleapis.com/css?family=Google+Sans\|Noto+Sans\|Castoro"
	rel="stylesheet">

	<link rel="stylesheet" href="./static/css/bulma.min.css">
	<link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
	<link rel="stylesheet" href="./static/css/bulma-slider.min.css">
	<link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
	<link rel="stylesheet"
	href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
	<link rel="stylesheet" href="./static/css/index.css">
	<link rel="icon" href="./static/images/favicon.svg">

	<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
	<script defer src="./static/js/fontawesome.all.min.js"></script>
	<script src="./static/js/bulma-carousel.min.js"></script>
	<script src="./static/js/bulma-slider.min.js"></script>
	<script src="./static/js/index.js"></script>
	</head>
	<body>

	<section class="hero">
	<div class="hero-body">
	<div class="container is-max-desktop">
	<div class="columns is-centered">
	<div class="column has-text-centered">
	<h1 class="title is-1 publication-title">RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning</h1>
	<div class="is-size-5 publication-authors">
	<span class="author-block">
	<a href="#" target="_blank">Suhang Hu</a><sup>*</sup>,</span>
	<span class="author-block">
	<a href="#" target="_blank">Wei Hu</a><sup>†</sup>,</span>
	<span class="author-block">
	<a href="#" target="_blank">Yuhang Su</a>,
	</span>
	<span class="author-block">
	<a href="#" target="_blank">Fan Zhang</a>
	</span>
	</div>

	<div class="is-size-5 publication-authors">
	<span class="author-block">Beijing University of Chemical Technology</span>
	</div>

	<div class="column has-text-centered">
	<div class="publication-links">
	<!-- PDF Link. -->
	<span class="link-block">
	<a href="static/images/RISE.pdf" target="_blank"
	class="external-link button is-normal is-rounded is-dark">
	<span class="icon">
	<i class="fas fa-file-pdf"></i>
	</span>
	<span>Paper</span>
	</a>
	</span>
	<span class="link-block">
	<a href="https://arxiv.org/abs/2508.13229" target="_blank"
	class="external-link button is-normal is-rounded is-dark">
	<span class="icon">
	<i class="ai ai-arxiv"></i>
	</span>
	<span>arXiv</span>
	</a>
	</span>
	<!-- Code Link. -->
	<span class="link-block">
	<a href="https://github.com/HSH55/RISE" target="_blank"
	class="external-link button is-normal is-rounded is-dark">
	<span class="icon">
	<i class="fab fa-github"></i>
	</span>
	<span>Code</span>
	</a>
	</span>
	</div>
	</div>
	</div>
	</div>
	</div>
	</div>
	</section>



	<section class="section">
	<div class="container is-max-desktop">
	<!-- Abstract. -->
	<div class="columns is-centered has-text-centered">
	<div class="column is-four-fifths">
	<h2 class="title is-3">Abstract</h2>
	<div class="content has-text-justified">
	<p>
	Vision-Language Models (VLMs) struggle with complex image annotation tasks, such as emotion classification and context-driven object detection, which demand sophisticated reasoning. Standard Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales, while Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training.
	</p>
	<p>
	We introduce <strong>RISE</strong> (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations. In the <strong>Reason</strong> stage (RISE-CoT), a reinforcement learning-driven "annotation-reasoning-annotation" closed-loop generates visually grounded, logically consistent CoTs by verifying their ability to reconstruct original annotations without direct leakage. The <strong>Inspire</strong> and <strong>Strengthen</strong> stage (RISE-R1) leverages a high-quality CoT subset for supervised fine-tuning, followed by reinforcement fine-tuning to produce interpretable reasoning and accurate annotations.
	</p>
	<p>
	Evaluated on complex and simple image annotation tasks, RISE-trained Qwen2-VL-2B outperforms SFT and Visual-RFT, achieving robust performance and enhanced explainability. RISE offers a self-supervised solution for advancing VLM reasoning without requiring manually annotated CoTs.
	</p>
	</div>
	</div>
	</div>
	<!--/ Abstract. -->
	</div>
	</section>

	<section class="section">
	<div class="container is-max-desktop">
	<h2 class="title is-3">RISE Framework</h2>

	<div class="columns is-centered">
	<div class="column is-full-width">
	<h3 class="title is-4">Two-Stage Approach</h3>

	<div class="content has-text-justified">
	<p>
	RISE operates through two stages to enhance VLM reasoning capabilities for image annotation tasks:
	</p>

	<h4 class="title is-5">1. RISE-CoT: Closed-Loop Reasoning Generation</h4>
	<p>
	This stage generates high-quality, visually grounded Chains of Thought (CoTs) for image-annotation pairs in a self-supervised manner. The process involves:
	</p>
	<ul>
	<li><strong>Reasoning Generation:</strong> VLM produces a CoT justifying the annotation without leaking specifics</li>
	<li><strong>Annotation Reconstruction:</strong> VLM reconstructs the annotation from the generated CoT</li>
	<li><strong>Consistency Validation:</strong> Reward function evaluates CoT quality based on reconstruction accuracy</li>
	</ul>

	<div class="content has-text-centered">
	<img src="./static/images/Rise-cot.png" alt="RISE Framework Diagram" style="width: 100%;">
	<p class="has-text-centered">Figure 1: RISE-CoT framework</p>
	</div>

	<h4 class="title is-5">2. RISE-R1: Training VLM for Enhanced CoTs</h4>
	<p>
	This stage trains the VLM to produce structured "think-answer" outputs:
	</p>
	<ul>
	<li><strong>Inspire (SFT):</strong> Supervised fine-tuning on high-quality CoT subset</li>
	<li><strong>Strengthen (RFT):</strong> Reinforcement fine-tuning on full dataset to optimize task-specific outputs</li>
	</ul>
	</div>

	<div class="content has-text-centered">
	<img src="./static/images/Rise-r1.png" alt="RISE Framework Diagram" style="width: 100%;">
	<p class="has-text-centered">Figure 2: RISE-R1 framework</p>
	</div>


	</div>
	</div>
	</div>
	</section>

	<section class="section">
	<div class="container is-max-desktop">
	<h2 class="title is-3">Experiments & Results</h2>

	<div class="columns is-centered">
	<div class="column">
	<h3 class="title is-4">Datasets</h3>
	<div class="content has-text-justified">
	<p>We evaluated RISE on four image annotation datasets with varying complexity:</p>
	<ul>
	<li><strong>Emotion6:</strong> Emotion classification with probability distributions</li>
	<li><strong>LISA:</strong> Context-driven object detection</li>
	<li><strong>ImageNet-Sub:</strong> Simple classification task</li>
	<li><strong>COCO-Sub:</strong> Multi-target object detection</li>
	</ul>
	</div>
	</div>

	<div class="column">
	<h3 class="title is-4">Key Results</h3>
	<div class="content has-text-justified">
	<p>RISE demonstrates superior performance across both complex and simple tasks:</p>
	<ul>
	<li>Outperforms SFT and Visual-RFT on Emotion6 and LISA</li>
	<li>Achieves robust performance on ImageNet-Sub and COCO-Sub</li>
	<li>Generates high-quality, interpretable Chains of Thought</li>
	<li>Provides self-supervised solution without manual CoT annotation</li>
	</ul>
	</div>
	</div>
	</div>


	</div>
	</section>

	<section class="section">
	<div class="container is-max-desktop">
	<h2 class="title is-3">Ablation Studies</h2>

	<div class="content has-text-justified">
	<p>Our ablation studies confirm the importance of key RISE components:</p>
	<ul>
	<li><strong>CoT Quality:</strong> RISE-CoT generates higher quality CoTs compared to Base-Model and GPT-4o</li>
	<li><strong>SFT Initialization:</strong> SFT on high-quality CoT subset is crucial for RFT success</li>
	<li><strong>Reward Function:</strong> Full reward function (with leakage prevention and format constraints) achieves best performance</li>
	<li><strong>Threshold Selection:</strong> τ=0.75 optimally balances CoT quality and dataset size</li>
	</ul>
	</div>


	</div>
	</section>

	<section class="section">
	<div class="container is-max-desktop">
	<h2 class="title is-3">Conclusion</h2>

	<div class="content has-text-justified">
	<p>
	We introduced RISE, a novel two-stage framework that significantly enhances VLMs for complex image annotation tasks.
	RISE autonomously generates high-quality CoTs by verifying their ability to reconstruct original annotations, then uses
	these CoTs to train VLMs to produce accurate and interpretable "think-answer" outputs directly from images.
	</p>
	<p>
	Through its verifiable, self-supervised CoT generation, RISE improves annotation accuracy and interpretability while
	uniquely enabling implicit evaluation and refinement of dataset annotation quality. This framework effectively boosts
	the reasoning capabilities of lower-capacity VLMs across various image annotation tasks, allowing them to perform akin to larger models.
	</p>
	</div>
	</div>
	</section>

	<section class="section" id="BibTeX">
	<div class="container is-max-desktop content">
	<h2 class="title">BibTeX</h2>
	<pre><code>@article{hu2024rise,
	title={RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning},
	author={Hu, Suhang and Hu, Wei and Su, Yuhang and Zhang, Fan},
	journal={arXiv preprint},
	year={2024}
	}</code></pre>
	</div>
	</section>

	<footer class="footer">
	<div class="container">
	<div class="content has-text-centered">
	<p>
	This website is licensed under a <a rel="license" target="_blank"
	href="http://creativecommons.org/licenses/by-sa/4.0/">Creative
	Commons Attribution-ShareAlike 4.0 International License</a>.
	</p>
	</div>
	</div>
	</footer>

	</body>
	</html>