mldlb commited on
Commit
7df7c06
·
verified ·
1 Parent(s): aaec13b

Upload 2 files

Browse files
Files changed (1) hide show
  1. index.html +135 -286
index.html CHANGED
@@ -3,10 +3,10 @@
3
  <head>
4
  <meta charset="utf-8">
5
  <meta name="description"
6
- content="Deformable Neural Radiance Fields creates free-viewpoint portraits (nerfies) from casually captured videos.">
7
- <meta name="keywords" content="Nerfies, D-NeRF, NeRF">
8
  <meta name="viewport" content="width=device-width, initial-scale=1">
9
- <title>Nerfies: Deformable Neural Radiance Fields</title>
10
 
11
  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
12
  rel="stylesheet">
@@ -33,39 +33,29 @@
33
  <div class="container is-max-desktop">
34
  <div class="columns is-centered">
35
  <div class="column has-text-centered">
36
- <h1 class="title is-1 publication-title">Nerfies: Deformable Neural Radiance Fields</h1>
37
  <div class="is-size-5 publication-authors">
38
  <span class="author-block">
39
- <a href="https://keunhong.com" target="_blank">Keunhong Park</a><sup>1</sup>,</span>
40
  <span class="author-block">
41
- <a href="https://utkarshsinha.com" target="_blank">Utkarsh Sinha</a><sup>2</sup>,</span>
42
  <span class="author-block">
43
- <a href="https://jonbarron.info" target="_blank">Jonathan T. Barron</a><sup>2</sup>,
44
  </span>
45
  <span class="author-block">
46
- <a href="http://sofienbouaziz.com" target="_blank">Sofien Bouaziz</a><sup>2</sup>,
47
- </span>
48
- <span class="author-block">
49
- <a href="https://www.danbgoldman.com" target="_blank">Dan B Goldman</a><sup>2</sup>,
50
- </span>
51
- <span class="author-block">
52
- <a href="https://homes.cs.washington.edu/~seitz/" target="_blank">Steven M. Seitz</a><sup>1,2</sup>,
53
- </span>
54
- <span class="author-block">
55
- <a href="http://www.ricardomartinbrualla.com" target="_blank">Ricardo Martin-Brualla</a><sup>2</sup>
56
  </span>
57
  </div>
58
 
59
  <div class="is-size-5 publication-authors">
60
- <span class="author-block"><sup>1</sup>University of Washington,</span>
61
- <span class="author-block"><sup>2</sup>Google Research</span>
62
  </div>
63
 
64
  <div class="column has-text-centered">
65
  <div class="publication-links">
66
  <!-- PDF Link. -->
67
  <span class="link-block">
68
- <a href="https://arxiv.org/pdf/2011.12948" target="_blank"
69
  class="external-link button is-normal is-rounded is-dark">
70
  <span class="icon">
71
  <i class="fas fa-file-pdf"></i>
@@ -74,7 +64,7 @@
74
  </a>
75
  </span>
76
  <span class="link-block">
77
- <a href="https://arxiv.org/abs/2011.12948" target="_blank"
78
  class="external-link button is-normal is-rounded is-dark">
79
  <span class="icon">
80
  <i class="ai ai-arxiv"></i>
@@ -82,19 +72,9 @@
82
  <span>arXiv</span>
83
  </a>
84
  </span>
85
- <!-- Video Link. -->
86
- <span class="link-block">
87
- <a href="https://www.youtube.com/watch?v=MrKrnHhk8IA" target="_blank"
88
- class="external-link button is-normal is-rounded is-dark">
89
- <span class="icon">
90
- <i class="fab fa-youtube"></i>
91
- </span>
92
- <span>Video</span>
93
- </a>
94
- </span>
95
  <!-- Code Link. -->
96
  <span class="link-block">
97
- <a href="https://github.com/google/nerfies" target="_blank"
98
  class="external-link button is-normal is-rounded is-dark">
99
  <span class="icon">
100
  <i class="fab fa-github"></i>
@@ -102,17 +82,7 @@
102
  <span>Code</span>
103
  </a>
104
  </span>
105
- <!-- Dataset Link. -->
106
- <span class="link-block">
107
- <a href="https://github.com/google/nerfies/releases/tag/0.1" target="_blank"
108
- class="external-link button is-normal is-rounded is-dark">
109
- <span class="icon">
110
- <i class="far fa-images"></i>
111
- </span>
112
- <span>Data</span>
113
- </a>
114
  </div>
115
-
116
  </div>
117
  </div>
118
  </div>
@@ -123,78 +93,14 @@
123
  <section class="hero teaser">
124
  <div class="container is-max-desktop">
125
  <div class="hero-body">
126
- <video id="teaser" autoplay muted loop playsinline height="100%">
127
- <source src="./static/videos/teaser.mp4"
128
- type="video/mp4">
129
- </video>
130
  <h2 class="subtitle has-text-centered">
131
- <span class="dnerf">Nerfies</span> turns selfie videos from your phone into
132
- free-viewpoint
133
- portraits.
134
  </h2>
135
  </div>
136
  </div>
137
  </section>
138
 
139
-
140
- <section class="hero is-light is-small">
141
- <div class="hero-body">
142
- <div class="container">
143
- <div id="results-carousel" class="carousel results-carousel">
144
- <div class="item item-steve">
145
- <video poster="" id="steve" autoplay controls muted loop playsinline height="100%">
146
- <source src="./static/videos/steve.mp4"
147
- type="video/mp4">
148
- </video>
149
- </div>
150
- <div class="item item-chair-tp">
151
- <video poster="" id="chair-tp" autoplay controls muted loop playsinline height="100%">
152
- <source src="./static/videos/chair-tp.mp4"
153
- type="video/mp4">
154
- </video>
155
- </div>
156
- <div class="item item-shiba">
157
- <video poster="" id="shiba" autoplay controls muted loop playsinline height="100%">
158
- <source src="./static/videos/shiba.mp4"
159
- type="video/mp4">
160
- </video>
161
- </div>
162
- <div class="item item-fullbody">
163
- <video poster="" id="fullbody" autoplay controls muted loop playsinline height="100%">
164
- <source src="./static/videos/fullbody.mp4"
165
- type="video/mp4">
166
- </video>
167
- </div>
168
- <div class="item item-blueshirt">
169
- <video poster="" id="blueshirt" autoplay controls muted loop playsinline height="100%">
170
- <source src="./static/videos/blueshirt.mp4"
171
- type="video/mp4">
172
- </video>
173
- </div>
174
- <div class="item item-mask">
175
- <video poster="" id="mask" autoplay controls muted loop playsinline height="100%">
176
- <source src="./static/videos/mask.mp4"
177
- type="video/mp4">
178
- </video>
179
- </div>
180
- <div class="item item-coffee">
181
- <video poster="" id="coffee" autoplay controls muted loop playsinline height="100%">
182
- <source src="./static/videos/coffee.mp4"
183
- type="video/mp4">
184
- </video>
185
- </div>
186
- <div class="item item-toby">
187
- <video poster="" id="toby" autoplay controls muted loop playsinline height="100%">
188
- <source src="./static/videos/toby2.mp4"
189
- type="video/mp4">
190
- </video>
191
- </div>
192
- </div>
193
- </div>
194
- </div>
195
- </section>
196
-
197
-
198
  <section class="section">
199
  <div class="container is-max-desktop">
200
  <!-- Abstract. -->
@@ -203,233 +109,176 @@
203
  <h2 class="title is-3">Abstract</h2>
204
  <div class="content has-text-justified">
205
  <p>
206
- We present the first method capable of photorealistically reconstructing a non-rigidly
207
- deforming scene using photos/videos captured casually from mobile phones.
208
  </p>
209
  <p>
210
- Our approach augments neural radiance fields
211
- (NeRF) by optimizing an
212
- additional continuous volumetric deformation field that warps each observed point into a
213
- canonical 5D NeRF.
214
- We observe that these NeRF-like deformation fields are prone to local minima, and
215
- propose a coarse-to-fine optimization method for coordinate-based models that allows for
216
- more robust optimization.
217
- By adapting principles from geometry processing and physical simulation to NeRF-like
218
- models, we propose an elastic regularization of the deformation field that further
219
- improves robustness.
220
  </p>
221
  <p>
222
- We show that <span class="dnerf">Nerfies</span> can turn casually captured selfie
223
- photos/videos into deformable NeRF
224
- models that allow for photorealistic renderings of the subject from arbitrary
225
- viewpoints, which we dub <i>"nerfies"</i>. We evaluate our method by collecting data
226
- using a
227
- rig with two mobile phones that take time-synchronized photos, yielding train/validation
228
- images of the same pose at different viewpoints. We show that our method faithfully
229
- reconstructs non-rigidly deforming scenes and reproduces unseen views with high
230
- fidelity.
231
  </p>
232
  </div>
233
  </div>
234
  </div>
235
  <!--/ Abstract. -->
236
-
237
- <!-- Paper video. -->
238
- <div class="columns is-centered has-text-centered">
239
- <div class="column is-four-fifths">
240
- <h2 class="title is-3">Video</h2>
241
- <div class="publication-video">
242
- <iframe src="https://www.youtube.com/embed/MrKrnHhk8IA?rel=0&amp;showinfo=0"
243
- frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>
244
- </div>
245
- </div>
246
- </div>
247
- <!--/ Paper video. -->
248
  </div>
249
  </section>
250
 
251
-
252
  <section class="section">
253
  <div class="container is-max-desktop">
254
-
255
- <div class="columns is-centered">
256
-
257
- <!-- Visual Effects. -->
258
- <div class="column">
259
- <div class="content">
260
- <h2 class="title is-3">Visual Effects</h2>
261
- <p>
262
- Using <i>nerfies</i> you can create fun visual effects. This Dolly zoom effect
263
- would be impossible without nerfies since it would require going through a wall.
264
- </p>
265
- <video id="dollyzoom" autoplay controls muted loop playsinline height="100%">
266
- <source src="./static/videos/dollyzoom-stacked.mp4"
267
- type="video/mp4">
268
- </video>
269
- </div>
270
- </div>
271
- <!--/ Visual Effects. -->
272
-
273
- <!-- Matting. -->
274
- <div class="column">
275
- <h2 class="title is-3">Matting</h2>
276
- <div class="columns is-centered">
277
- <div class="column content">
278
- <p>
279
- As a byproduct of our method, we can also solve the matting problem by ignoring
280
- samples that fall outside of a bounding box during rendering.
281
- </p>
282
- <video id="matting-video" controls playsinline height="100%">
283
- <source src="./static/videos/matting.mp4"
284
- type="video/mp4">
285
- </video>
286
- </div>
287
-
288
- </div>
289
- </div>
290
- </div>
291
- <!--/ Matting. -->
292
-
293
- <!-- Animation. -->
294
  <div class="columns is-centered">
295
  <div class="column is-full-width">
296
- <h2 class="title is-3">Animation</h2>
297
-
298
- <!-- Interpolating. -->
299
- <h3 class="title is-4">Interpolating states</h3>
300
  <div class="content has-text-justified">
301
  <p>
302
- We can also animate the scene by interpolating the deformation latent codes of two input
303
- frames. Use the slider here to linearly interpolate between the left frame and the right
304
- frame.
305
  </p>
306
- </div>
307
- <div class="columns is-vcentered interpolation-panel">
308
- <div class="column is-3 has-text-centered">
309
- <img src="./static/images/interpolate_start.jpg"
310
- class="interpolation-image"
311
- alt="Interpolate start reference image."/>
312
- <p>Start Frame</p>
313
- </div>
314
- <div class="column interpolation-video-column">
315
- <div id="interpolation-image-wrapper">
316
- Loading...
317
- </div>
318
- <input class="slider is-fullwidth is-large is-info"
319
- id="interpolation-slider"
320
- step="1" min="0" max="100" value="0" type="range">
321
- </div>
322
- <div class="column is-3 has-text-centered">
323
- <img src="./static/images/interpolate_end.jpg"
324
- class="interpolation-image"
325
- alt="Interpolation end reference image."/>
326
- <p class="is-bold">End Frame</p>
327
- </div>
328
- </div>
329
- <br/>
330
- <!--/ Interpolating. -->
331
-
332
- <!-- Re-rendering. -->
333
- <h3 class="title is-4">Re-rendering the input video</h3>
334
- <div class="content has-text-justified">
335
  <p>
336
- Using <span class="dnerf">Nerfies</span>, you can re-render a video from a novel
337
- viewpoint such as a stabilized camera by playing back the training deformations.
338
  </p>
 
 
 
 
339
  </div>
 
340
  <div class="content has-text-centered">
341
- <video id="replay-video"
342
- controls
343
- muted
344
- preload
345
- playsinline
346
- width="75%">
347
- <source src="./static/videos/replay.mp4"
348
- type="video/mp4">
349
- </video>
350
  </div>
351
- <!--/ Re-rendering. -->
352
-
353
  </div>
354
  </div>
355
- <!--/ Animation. -->
356
-
357
 
358
- <!-- Concurrent Work. -->
 
 
 
359
  <div class="columns is-centered">
360
- <div class="column is-full-width">
361
- <h2 class="title is-3">Related Links</h2>
362
-
363
  <div class="content has-text-justified">
364
- <p>
365
- There's a lot of excellent work that was introduced around the same time as ours.
366
- </p>
367
- <p>
368
- <a href="https://arxiv.org/abs/2104.09125" target="_blank">Progressive Encoding for Neural Optimization</a> introduces an idea similar to our windowed position encoding for coarse-to-fine optimization.
369
- </p>
370
- <p>
371
- <a href="https://www.albertpumarola.com/research/D-NeRF/index.html" target="_blank">D-NeRF</a> and <a href="https://gvv.mpi-inf.mpg.de/projects/nonrigid_nerf/" target="_blank">NR-NeRF</a>
372
- both use deformation fields to model non-rigid scenes.
373
- </p>
374
- <p>
375
- Some works model videos with a NeRF by directly modulating the density, such as <a href="https://video-nerf.github.io/" target="_blank">Video-NeRF</a>, <a href="https://www.cs.cornell.edu/~zl548/NSFF/" target="_blank">NSFF</a>, and <a href="https://neural-3d-video.github.io/" target="_blank">DyNeRF</a>
376
- </p>
377
- <p>
378
- There are probably many more by the time you are reading this. Check out <a href="https://dellaert.github.io/NeRF/" target="_blank">Frank Dellart's survey on recent NeRF papers</a>, and <a href="https://github.com/yenchenlin/awesome-NeRF" target="_blank">Yen-Chen Lin's curated list of NeRF papers</a>.
379
- </p>
380
  </div>
381
  </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
382
  </div>
383
- <!--/ Concurrent Work. -->
 
 
 
 
 
 
 
 
384
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
385
  </div>
386
  </section>
387
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
388
 
389
  <section class="section" id="BibTeX">
390
  <div class="container is-max-desktop content">
391
  <h2 class="title">BibTeX</h2>
392
- <pre><code>@article{park2021nerfies,
393
- author = {Park, Keunhong and Sinha, Utkarsh and Barron, Jonathan T. and Bouaziz, Sofien and Goldman, Dan B and Seitz, Steven M. and Martin-Brualla, Ricardo},
394
- title = {Nerfies: Deformable Neural Radiance Fields},
395
- journal = {ICCV},
396
- year = {2021},
397
  }</code></pre>
398
  </div>
399
  </section>
400
 
401
-
402
  <footer class="footer">
403
  <div class="container">
404
  <div class="content has-text-centered">
405
- <a class="icon-link" target="_blank"
406
- href="./static/videos/nerfies_paper.pdf">
407
- <i class="fas fa-file-pdf"></i>
408
- </a>
409
- <a class="icon-link" href="https://github.com/keunhong" target="_blank" class="external-link" disabled>
410
- <i class="fab fa-github"></i>
411
- </a>
412
- </div>
413
- <div class="columns is-centered">
414
- <div class="column is-8">
415
- <div class="content">
416
- <p>
417
- This website is licensed under a <a rel="license" target="_blank"
418
- href="http://creativecommons.org/licenses/by-sa/4.0/">Creative
419
- Commons Attribution-ShareAlike 4.0 International License</a>.
420
- </p>
421
- <p>
422
- This means you are free to borrow the <a target="_blank"
423
- href="https://github.com/nerfies/nerfies.github.io">source code</a> of this website,
424
- we just ask that you link back to this page in the footer.
425
- Please remember to remove the analytics code included in the header of the website which
426
- you do not want on your website.
427
- </p>
428
- </div>
429
- </div>
430
  </div>
431
  </div>
432
  </footer>
433
 
434
  </body>
435
- </html>
 
3
  <head>
4
  <meta charset="utf-8">
5
  <meta name="description"
6
+ content="RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning">
7
+ <meta name="keywords" content="RISE, VLM, Vision-Language Models, Image Annotation, Chain of Thought, CoT">
8
  <meta name="viewport" content="width=device-width, initial-scale=1">
9
+ <title>RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning</title>
10
 
11
  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
12
  rel="stylesheet">
 
33
  <div class="container is-max-desktop">
34
  <div class="columns is-centered">
35
  <div class="column has-text-centered">
36
+ <h1 class="title is-1 publication-title">RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning</h1>
37
  <div class="is-size-5 publication-authors">
38
  <span class="author-block">
39
+ <a href="#" target="_blank">Suhang Hu</a><sup>1,*</sup>,</span>
40
  <span class="author-block">
41
+ <a href="#" target="_blank">Wei Hu</a><sup>†</sup>,</span>
42
  <span class="author-block">
43
+ <a href="#" target="_blank">Yuhang Su</a>,
44
  </span>
45
  <span class="author-block">
46
+ <a href="#" target="_blank">Fan Zhang</a>
 
 
 
 
 
 
 
 
 
47
  </span>
48
  </div>
49
 
50
  <div class="is-size-5 publication-authors">
51
+ <span class="author-block"><sup>1</sup>Beijing University of Chemical Technology</span>
 
52
  </div>
53
 
54
  <div class="column has-text-centered">
55
  <div class="publication-links">
56
  <!-- PDF Link. -->
57
  <span class="link-block">
58
+ <a href="#" target="_blank"
59
  class="external-link button is-normal is-rounded is-dark">
60
  <span class="icon">
61
  <i class="fas fa-file-pdf"></i>
 
64
  </a>
65
  </span>
66
  <span class="link-block">
67
+ <a href="#" target="_blank"
68
  class="external-link button is-normal is-rounded is-dark">
69
  <span class="icon">
70
  <i class="ai ai-arxiv"></i>
 
72
  <span>arXiv</span>
73
  </a>
74
  </span>
 
 
 
 
 
 
 
 
 
 
75
  <!-- Code Link. -->
76
  <span class="link-block">
77
+ <a href="#" target="_blank"
78
  class="external-link button is-normal is-rounded is-dark">
79
  <span class="icon">
80
  <i class="fab fa-github"></i>
 
82
  <span>Code</span>
83
  </a>
84
  </span>
 
 
 
 
 
 
 
 
 
85
  </div>
 
86
  </div>
87
  </div>
88
  </div>
 
93
  <section class="hero teaser">
94
  <div class="container is-max-desktop">
95
  <div class="hero-body">
96
+ <img src="#" alt="RISE Framework Overview" style="width: 100%;">
 
 
 
97
  <h2 class="subtitle has-text-centered">
98
+ RISE: A two-stage framework for self-supervised reasoning in Vision-Language Models
 
 
99
  </h2>
100
  </div>
101
  </div>
102
  </section>
103
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  <section class="section">
105
  <div class="container is-max-desktop">
106
  <!-- Abstract. -->
 
109
  <h2 class="title is-3">Abstract</h2>
110
  <div class="content has-text-justified">
111
  <p>
112
+ Vision-Language Models (VLMs) struggle with complex image annotation tasks, such as emotion classification and context-driven object detection, which demand sophisticated reasoning. Standard Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales, while Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training.
 
113
  </p>
114
  <p>
115
+ We introduce <strong>RISE</strong> (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations. In the <strong>Reason</strong> stage (RISE-CoT), a reinforcement learning-driven "annotation-reasoning-annotation" closed-loop generates visually grounded, logically consistent CoTs by verifying their ability to reconstruct original annotations without direct leakage. The <strong>Inspire</strong> and <strong>Strengthen</strong> stage (RISE-R1) leverages a high-quality CoT subset for supervised fine-tuning, followed by reinforcement fine-tuning to produce interpretable reasoning and accurate annotations.
 
 
 
 
 
 
 
 
 
116
  </p>
117
  <p>
118
+ Evaluated on complex and simple image annotation tasks, RISE-trained Qwen2-VL-2B outperforms SFT and Visual-RFT, achieving robust performance and enhanced explainability. RISE offers a self-supervised solution for advancing VLM reasoning without requiring manually annotated CoTs.
 
 
 
 
 
 
 
 
119
  </p>
120
  </div>
121
  </div>
122
  </div>
123
  <!--/ Abstract. -->
 
 
 
 
 
 
 
 
 
 
 
 
124
  </div>
125
  </section>
126
 
 
127
  <section class="section">
128
  <div class="container is-max-desktop">
129
+ <h2 class="title is-3">RISE Framework</h2>
130
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
131
  <div class="columns is-centered">
132
  <div class="column is-full-width">
133
+ <h3 class="title is-4">Two-Stage Approach</h3>
134
+
 
 
135
  <div class="content has-text-justified">
136
  <p>
137
+ RISE operates through two stages to enhance VLM reasoning capabilities for image annotation tasks:
 
 
138
  </p>
139
+
140
+ <h4 class="title is-5">1. RISE-CoT: Closed-Loop Reasoning Generation</h4>
141
+ <p>
142
+ This stage generates high-quality, visually grounded Chains of Thought (CoTs) for image-annotation pairs in a self-supervised manner. The process involves:
143
+ </p>
144
+ <ul>
145
+ <li><strong>Reasoning Generation:</strong> VLM produces a CoT justifying the annotation without leaking specifics</li>
146
+ <li><strong>Annotation Reconstruction:</strong> VLM reconstructs the annotation from the generated CoT</li>
147
+ <li><strong>Consistency Validation:</strong> Reward function evaluates CoT quality based on reconstruction accuracy</li>
148
+ </ul>
149
+
150
+ <h4 class="title is-5">2. RISE-R1: Training VLM for Enhanced CoTs</h4>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
151
  <p>
152
+ This stage trains the VLM to produce structured "think-answer" outputs:
 
153
  </p>
154
+ <ul>
155
+ <li><strong>Inspire (SFT):</strong> Supervised fine-tuning on high-quality CoT subset</li>
156
+ <li><strong>Strengthen (RFT):</strong> Reinforcement fine-tuning on full dataset to optimize task-specific outputs</li>
157
+ </ul>
158
  </div>
159
+
160
  <div class="content has-text-centered">
161
+ <img src="#" alt="RISE Framework Diagram" style="width: 80%;">
162
+ <p class="has-text-centered">Figure 1: RISE two-stage framework</p>
 
 
 
 
 
 
 
163
  </div>
 
 
164
  </div>
165
  </div>
166
+ </div>
167
+ </section>
168
 
169
+ <section class="section">
170
+ <div class="container is-max-desktop">
171
+ <h2 class="title is-3">Experiments & Results</h2>
172
+
173
  <div class="columns is-centered">
174
+ <div class="column">
175
+ <h3 class="title is-4">Datasets</h3>
 
176
  <div class="content has-text-justified">
177
+ <p>We evaluated RISE on four image annotation datasets with varying complexity:</p>
178
+ <ul>
179
+ <li><strong>Emotion6:</strong> Emotion classification with probability distributions</li>
180
+ <li><strong>LISA:</strong> Context-driven object detection</li>
181
+ <li><strong>ImageNet-Sub:</strong> Simple classification task</li>
182
+ <li><strong>COCO-Sub:</strong> Multi-target object detection</li>
183
+ </ul>
 
 
 
 
 
 
 
 
 
184
  </div>
185
  </div>
186
+
187
+ <div class="column">
188
+ <h3 class="title is-4">Key Results</h3>
189
+ <div class="content has-text-justified">
190
+ <p>RISE demonstrates superior performance across both complex and simple tasks:</p>
191
+ <ul>
192
+ <li>Outperforms SFT and Visual-RFT on Emotion6 and LISA</li>
193
+ <li>Achieves robust performance on ImageNet-Sub and COCO-Sub</li>
194
+ <li>Generates high-quality, interpretable Chains of Thought</li>
195
+ <li>Provides self-supervised solution without manual CoT annotation</li>
196
+ </ul>
197
+ </div>
198
+ </div>
199
+ </div>
200
+
201
+ <div class="columns is-centered">
202
+ <div class="column content has-text-centered">
203
+ <img src="#" alt="Results Table" style="width: 90%;">
204
+ <p>Table 1: Performance comparison on complex tasks (Emotion6 and LISA)</p>
205
+ </div>
206
  </div>
207
+
208
+ <div class="columns is-centered">
209
+ <div class="column content has-text-centered">
210
+ <img src="#" alt="Qualitative Results" style="width: 90%;">
211
+ <p>Figure 2: Qualitative examples of RISE's "think-answer" outputs</p>
212
+ </div>
213
+ </div>
214
+ </div>
215
+ </section>
216
 
217
+ <section class="section">
218
+ <div class="container is-max-desktop">
219
+ <h2 class="title is-3">Ablation Studies</h2>
220
+
221
+ <div class="content has-text-justified">
222
+ <p>Our ablation studies confirm the importance of key RISE components:</p>
223
+ <ul>
224
+ <li><strong>CoT Quality:</strong> RISE-CoT generates higher quality CoTs compared to Base-Model and GPT-4o</li>
225
+ <li><strong>SFT Initialization:</strong> SFT on high-quality CoT subset is crucial for RFT success</li>
226
+ <li><strong>Reward Function:</strong> Full reward function (with leakage prevention and format constraints) achieves best performance</li>
227
+ <li><strong>Threshold Selection:</strong> τ=0.75 optimally balances CoT quality and dataset size</li>
228
+ </ul>
229
+ </div>
230
+
231
+ <div class="columns is-centered">
232
+ <div class="column content has-text-centered">
233
+ <img src="#" alt="Ablation Results" style="width: 80%;">
234
+ <p>Table 2: Ablation study on CoT quality</p>
235
+ </div>
236
+ </div>
237
  </div>
238
  </section>
239
 
240
+ <section class="section">
241
+ <div class="container is-max-desktop">
242
+ <h2 class="title is-3">Conclusion</h2>
243
+
244
+ <div class="content has-text-justified">
245
+ <p>
246
+ We introduced RISE, a novel two-stage framework that significantly enhances VLMs for complex image annotation tasks.
247
+ RISE autonomously generates high-quality CoTs by verifying their ability to reconstruct original annotations, then uses
248
+ these CoTs to train VLMs to produce accurate and interpretable "think-answer" outputs directly from images.
249
+ </p>
250
+ <p>
251
+ Through its verifiable, self-supervised CoT generation, RISE improves annotation accuracy and interpretability while
252
+ uniquely enabling implicit evaluation and refinement of dataset annotation quality. This framework effectively boosts
253
+ the reasoning capabilities of lower-capacity VLMs across various image annotation tasks, allowing them to perform akin to larger models.
254
+ </p>
255
+ </div>
256
+ </div>
257
+ </section>
258
 
259
  <section class="section" id="BibTeX">
260
  <div class="container is-max-desktop content">
261
  <h2 class="title">BibTeX</h2>
262
+ <pre><code>@article{hu2024rise,
263
+ title={RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning},
264
+ author={Hu, Suhang and Hu, Wei and Su, Yuhang and Zhang, Fan},
265
+ journal={arXiv preprint},
266
+ year={2024}
267
  }</code></pre>
268
  </div>
269
  </section>
270
 
 
271
  <footer class="footer">
272
  <div class="container">
273
  <div class="content has-text-centered">
274
+ <p>
275
+ This website is licensed under a <a rel="license" target="_blank"
276
+ href="http://creativecommons.org/licenses/by-sa/4.0/">Creative
277
+ Commons Attribution-ShareAlike 4.0 International License</a>.
278
+ </p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
279
  </div>
280
  </div>
281
  </footer>
282
 
283
  </body>
284
+ </html>