gemni16
/

Wan2.1-1.3B-Decompose_Channels_full

Model card Files Files and versions

xet

Community

gemni16 commited on Sep 29, 2025

Commit

12b0e5b

verified ·

1 Parent(s): a24bdf0

Upload 0R3.md

Browse files

Files changed (1) hide show

Wan2.1-RGBX-1.3B-Decompose_Channels_full/0R3.md +36 -0

Wan2.1-RGBX-1.3B-Decompose_Channels_full/0R3.md ADDED Viewed

	@@ -0,0 +1,36 @@

+W1 & Q1: Clarification on Foreground Extraction and Matting Method
+Thank you so much for pointing this out. We currently adopt Inspyrenet [1], a strong open-source matting model we tested, due to its robustness and edge preservation across diverse in-the-wild scenarios.
+As described in Appendix B.5, our foreground extraction pipeline begins with GPT-4V filtering to select videos from Pexels and Mixkit that exhibit clear foreground-background separation. We then apply Inspyrenet to obtain initial foreground masks, followed by manual filtering as a double-check step to ensure quality and discard cases with low matting confidence or ambiguous boundaries.
+We will add a brief description and citation of the matting tool and clarify this process more explicitly in the final version, with more qualitative examples.
+[1] Revisiting image pyramid structure for high resolution salient object detection.
+---
+W2 & Q2: Supporting Literal Lighting Guidance
+We thank the reviewer for the valuable comment. Our model offers several control configurations involving text-guided relighting, including:
+- Text-only + foreground video: To isolate lighting from scene context, we inpaint a gray background (see Fig. 4), enabling literal relighting guided solely by text prompts.
+- Text + original video: The input background is retained, allowing users to preserve scene semantics while modifying illumination. (see Fig. 12)
+- Text + foreground + background video: Background video is encoded separately and combined via cross-attention (Fig. 8), enabling more expressive lighting consistent with both prompt and context.
+While many prompts naturally contain both lighting and scene cues (e.g., “sunlight filtering through trees”), we also show text-guided relighting without changing the scene context—as in Fig. 8 (“add purple light,” with background preserved) and Fig. 1 (“natural lighting,” applied without scene-specific context). We emphasize that literal lighting control via text is supported, and will add more such examples to further illustrate the model’s ability to decouple text relighting from scene content.
+---
+<!-- 未来fine-grained control -->
+W3 & Q3: Model Architecture and Extension
+Thank you for the thoughtful suggestion. Our framework is built on IC-Light’s UNet backbone to fully leverage its strong relighting priors, which significantly boosts quality and training efficiency.
+While we currently adopt UNet inflation with temporal layers, future exploration with advanced architectures like DiT is possible, as our dataset and conditioning scheme remain compatible. However, such a migration would require retraining the relighting capability from scratch—since DiT lacks rich illumination priors—as well as adapting temporal modeling and condition injection to the new architecture. This comes with high computational cost and uncertain benefits.
+We therefore focus RelightVid on maximizing strong pre-trained priors, combined with temporal consistency and multi-modal control. Exploring other backbones remains a promising direction for future work.