--- license: apache-2.0 tags: - video-generation - game-rendering - game-editing - diffusion - g-buffer - relighting - text-to-video - wan2.1 pipeline_tag: text-to-video base_model: Wan-AI/Wan2.1-T2V-1.3B datasets: - custom library_name: diffusers --- # Game Editing **Game Editing** is a fine-tuned video diffusion model for controllable game video synthesis. It enables users to manipulate lighting and environmental effects in game footage via text prompts, conditioned on G-buffer inputs. ## Model Details | Attribute | Detail | |-----------|--------| | **Base Model** | [Wan 2.1-T2V-1.3B](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) | | **Parameters** | 1.42B (BF16) | | **Resolution** | 832 × 480 (480p) | | **Frame Rate** | 16 FPS | | **Clip Length** | 81 frames | | **Format** | SafeTensors | ## Inputs The model takes the following inputs: - **G-buffers** as conditional inputs, providing dense geometric and material priors: - **Basecolor** (albedo) - **Normal** (surface normals) - **Depth** - **Roughness** - **Metallic** - **Text prompt** describing the desired lighting and environmental effects The G-buffers encode the scene's geometry and materials, while the text prompt controls lighting conditions, atmospheric effects, and overall visual style. This decoupled design allows users to edit the visual appearance of game footage without altering the underlying scene structure. ## Training ### Architecture We adapt [Wan 2.1-T2V-1.3B](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) by incorporating G-buffers (dense geometric and material priors) as conditional inputs. The model is fully fine-tuned following the original training configuration of the base model. ### Data The model is trained on video clips from the [**Black Myth: Wukong** dataset](https://github.com/ShandaAI/AlayaRenderer). Descriptive captions for each clip are generated using [Qwen3-VL-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct). Since G-buffers already provide dense geometric and material information, the captions focus exclusively on **lighting and environmental effects**, enabling fine-grained text-based control over these attributes during inference. ### Procedure - **Full fine-tuning** on the [Black Myth: Wukong dataset](https://github.com/ShandaAI/AlayaRenderer) - Spatial resolution: **832 × 480** (480p) - Frame rate: **16 FPS** - Clip length: **81 frames** ## Evaluation & Generalization In the absence of directly comparable methods, we establish a baseline by adapting DiffusionRenderer's forward renderer with DiffusionLight-extracted environment maps as lighting conditions. - A held-out subset of Black Myth: Wukong is used for testing. - **Cross-dataset evaluation** on **Cyberpunk 2077** demonstrates strong generalization to unseen game environments, maintaining high-fidelity and controllable video synthesis. ## Intended Use - **Game video editing**: Manipulate lighting and environmental effects in game footage through text descriptions. - **Controllable video synthesis**: Generate stylized game video conditioned on G-buffers and text prompts. ## Citation If you find this model useful, please consider citing our work.