game-editing / README.md
Brian9999's picture
Super-squash branch 'main' using huggingface_hub
8625d6a
---
license: apache-2.0
tags:
- video-generation
- game-rendering
- game-editing
- diffusion
- g-buffer
- relighting
- text-to-video
- wan2.1
pipeline_tag: text-to-video
base_model: Wan-AI/Wan2.1-T2V-1.3B
datasets:
- custom
library_name: diffusers
---
# Game Editing
**Game Editing** is a fine-tuned video diffusion model for controllable game video synthesis. It enables users to manipulate lighting and environmental effects in game footage via text prompts, conditioned on G-buffer inputs.
## Model Details
| Attribute | Detail |
|-----------|--------|
| **Base Model** | [Wan 2.1-T2V-1.3B](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) |
| **Parameters** | 1.42B (BF16) |
| **Resolution** | 832 × 480 (480p) |
| **Frame Rate** | 16 FPS |
| **Clip Length** | 81 frames |
| **Format** | SafeTensors |
## Inputs
The model takes the following inputs:
- **G-buffers** as conditional inputs, providing dense geometric and material priors:
- **Basecolor** (albedo)
- **Normal** (surface normals)
- **Depth**
- **Roughness**
- **Metallic**
- **Text prompt** describing the desired lighting and environmental effects
The G-buffers encode the scene's geometry and materials, while the text prompt controls lighting conditions, atmospheric effects, and overall visual style. This decoupled design allows users to edit the visual appearance of game footage without altering the underlying scene structure.
## Training
### Architecture
We adapt [Wan 2.1-T2V-1.3B](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) by incorporating G-buffers (dense geometric and material priors) as conditional inputs. The model is fully fine-tuned following the original training configuration of the base model.
### Data
The model is trained on video clips from the [**Black Myth: Wukong** dataset](https://github.com/ShandaAI/AlayaRenderer). Descriptive captions for each clip are generated using [Qwen3-VL-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct). Since G-buffers already provide dense geometric and material information, the captions focus exclusively on **lighting and environmental effects**, enabling fine-grained text-based control over these attributes during inference.
### Procedure
- **Full fine-tuning** on the [Black Myth: Wukong dataset](https://github.com/ShandaAI/AlayaRenderer)
- Spatial resolution: **832 × 480** (480p)
- Frame rate: **16 FPS**
- Clip length: **81 frames**
## Evaluation & Generalization
In the absence of directly comparable methods, we establish a baseline by adapting DiffusionRenderer's forward renderer with DiffusionLight-extracted environment maps as lighting conditions.
- A held-out subset of Black Myth: Wukong is used for testing.
- **Cross-dataset evaluation** on **Cyberpunk 2077** demonstrates strong generalization to unseen game environments, maintaining high-fidelity and controllable video synthesis.
## Intended Use
- **Game video editing**: Manipulate lighting and environmental effects in game footage through text descriptions.
- **Controllable video synthesis**: Generate stylized game video conditioned on G-buffers and text prompts.
## Citation
If you find this model useful, please consider citing our work.