Brian9999
/

game-editing

video-generation

Model card Files Files and versions

game-editing / README.md

Brian9999's picture

Super-squash branch 'main' using huggingface_hub

8625d6a 3 days ago

|

history blame contribute delete

3.23 kB

	---
	license: apache-2.0
	tags:
	- video-generation
	- game-rendering
	- game-editing
	- diffusion
	- g-buffer
	- relighting
	- text-to-video
	- wan2.1
	pipeline_tag: text-to-video
	base_model: Wan-AI/Wan2.1-T2V-1.3B
	datasets:
	- custom
	library_name: diffusers
	---

	# Game Editing

	Game Editing is a fine-tuned video diffusion model for controllable game video synthesis. It enables users to manipulate lighting and environmental effects in game footage via text prompts, conditioned on G-buffer inputs.

	## Model Details

	\| Attribute \| Detail \|
	\|-----------\|--------\|
	\| Base Model \| [Wan 2.1-T2V-1.3B](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) \|
	\| Parameters \| 1.42B (BF16) \|
	\| Resolution \| 832 × 480 (480p) \|
	\| Frame Rate \| 16 FPS \|
	\| Clip Length \| 81 frames \|
	\| Format \| SafeTensors \|

	## Inputs

	The model takes the following inputs:

	- G-buffers as conditional inputs, providing dense geometric and material priors:
	- Basecolor (albedo)
	- Normal (surface normals)
	- Depth
	- Roughness
	- Metallic
	- Text prompt describing the desired lighting and environmental effects

	The G-buffers encode the scene's geometry and materials, while the text prompt controls lighting conditions, atmospheric effects, and overall visual style. This decoupled design allows users to edit the visual appearance of game footage without altering the underlying scene structure.

	## Training

	### Architecture

	We adapt [Wan 2.1-T2V-1.3B](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) by incorporating G-buffers (dense geometric and material priors) as conditional inputs. The model is fully fine-tuned following the original training configuration of the base model.

	### Data

	The model is trained on video clips from the [Black Myth: Wukong dataset](https://github.com/ShandaAI/AlayaRenderer). Descriptive captions for each clip are generated using [Qwen3-VL-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct). Since G-buffers already provide dense geometric and material information, the captions focus exclusively on lighting and environmental effects, enabling fine-grained text-based control over these attributes during inference.

	### Procedure

	- Full fine-tuning on the [Black Myth: Wukong dataset](https://github.com/ShandaAI/AlayaRenderer)
	- Spatial resolution: 832 × 480 (480p)
	- Frame rate: 16 FPS
	- Clip length: 81 frames

	## Evaluation & Generalization

	In the absence of directly comparable methods, we establish a baseline by adapting DiffusionRenderer's forward renderer with DiffusionLight-extracted environment maps as lighting conditions.

	- A held-out subset of Black Myth: Wukong is used for testing.
	- Cross-dataset evaluation on Cyberpunk 2077 demonstrates strong generalization to unseen game environments, maintaining high-fidelity and controllable video synthesis.

	## Intended Use

	- Game video editing: Manipulate lighting and environmental effects in game footage through text descriptions.
	- Controllable video synthesis: Generate stylized game video conditioned on G-buffers and text prompts.

	## Citation

	If you find this model useful, please consider citing our work.