SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models

SCOPE Teaser

SCOPE is an interactive world model for FPS games with 10-DoF action control, trained on 69K clips across 7 games.

Project Page GitHub arXiv License

Highlights

  • Hybrid Action Space โ€” Jointly processes continuous (4D dual-joystick) and discrete (6 binary buttons) control signals within a unified framework โ€” the first FPS world model to do so.
  • Dense Per-Frame Conditioning โ€” Resolves overlapping actions at every single frame, enabling simultaneous multi-action composition (e.g., moving + aiming + firing) that reflects real gameplay complexity.
  • Cross-Game Generalization โ€” Trained on 7 diverse FPS titles, a single model generalizes zero-shot to entirely unseen game environments without fine-tuning.
  • In-Scope / Out-of-Scope Decoupling โ€” Spatially selective conditioning that separates localized in-scope effects (weapon recoil, HUD) from stable out-of-scope world generation โ€” without any segmentation labels.

Model Overview

SCOPE is an interactive world model for first-person shooter (FPS) games. Built on Wan2.2-TI2V-5B, SCOPE inserts a conditioning module into each transformer block that reshapes features into per-pixel temporal sequences. Each spatial position computes its action response from local visual content, naturally separating in-scope effects (e.g., weapon firing, reloading) from out-of-scope world generation (e.g., stable surroundings) โ€” without any segmentation labels.

Trained on CrossFPS (69K clips, 7 games, 10-DoF), SCOPE learns general visual-to-action mappings rather than game-specific patterns, enabling zero-shot transfer to unseen scenes.

Architecture

Component Details
Base Model Wan2.2-TI2V-5B (DiT, 30 transformer layers)
Action Module Per-block conditioning with per-pixel temporal sequences
Text Encoder UMT5-XXL (4096-dim hidden)
VAE Wan2.2 Video VAE (4ร— temporal compression, 8ร— spatial compression)
Total Parameters ~5B (1575 tensors, of which 750 are action-related)
Precision BFloat16

ActionModule Design

Each of the 30 DiT blocks contains an ActionModule with two conditioning paths:

  • Mouse/Joystick Path: Sliding-window temporal features โ†’ MLP fusion โ†’ pixel-wise temporal self-attention with RoPE
  • Keyboard/Button Path: Button embedding โ†’ temporal windowing โ†’ cross-attention (video queries, keyboard keys/values)

Both output projections are zero-initialized for stable residual training on top of frozen pretrained weights.

Generation Specs

Property Value
Resolution 480 ร— 832
Frame Count 81 frames
Frame Rate 20 FPS
Duration ~4 seconds
Inference Steps 30 (default)

Action Input Format

SCOPE accepts 10-DoF action inputs per frame via a Parquet file:

Controller Buttons (6D binary):

Index Column Action
0 right_trigger Fire (RT)
1 left_trigger Aim Down Sights (LT)
2 south Jump (A)
3 right_thumb Melee (R3)
4 west Reload (X)
5 north Weapon Switch (Y)

Dual Joystick (4D continuous):

Column Axes Function
j_left [x, y] Character movement (left stick)
j_right [x, y] Camera rotation (right stick)

Quick Start

Requirements

  • Python >= 3.10
  • PyTorch >= 2.0 with CUDA support
  • GPU: NVIDIA with >= 24 GB VRAM (single GPU inference with CPU offload)

Installation

git clone https://github.com/z2tong/SCOPE.git
cd SCOPE
pip install -e .

Download Weights

# Download all weights (SCOPE DiT + Text Encoder + VAE + Tokenizer) in one command
huggingface-cli download zizhaotong/SCOPE --local-dir ./SCOPE

Inference

Single image + action sequence:

python inference.py \
    --model_dir ./SCOPE \
    --input_image input.png \
    --action_path action.parquet \
    --prompt "First-person shooter perspective in a modern city" \
    --seed 42

Batch processing (directory of images):

python inference.py \
    --model_dir ./SCOPE \
    --input_image_dir ./images \
    --action_path action.parquet \
    --prompt "First-person view in a battlefield" \
    --output_dir ./outputs

For full usage details and advanced options, see the GitHub repository.

Repository Contents

This repo contains all weights needed for inference in a single download:

File Component Size
model-00001-of-00003.safetensors SCOPE DiT shard 1 ~5.0 GB
model-00002-of-00003.safetensors SCOPE DiT shard 2 ~5.0 GB
model-00003-of-00003.safetensors SCOPE DiT shard 3 ~4.6 GB
model.safetensors.index.json Shard index mapping โ€”
models_t5_umt5-xxl-enc-bf16.pth Text Encoder (UMT5-XXL) ~20 GB
Wan2.2_VAE.pth Video VAE ~700 MB
google/umt5-xxl/ Tokenizer ~10 MB
config.json Model architecture config โ€”

Inference code is available at github.com/z2tong/SCOPE.

CrossFPS Dataset

SCOPE is trained on CrossFPS, the first multi-game FPS dataset with frame-aligned action telemetry:

Property Value
Games 7 diverse FPS titles
Total Clips 69,000+
Action Dimensions 10-DoF (6 buttons + 4D joystick)
Annotation Frame-aligned action telemetry
Curation Gameplay-bias removal for general visual-to-action mapping

Citation

@article{scope2026,
  title={SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models},
  author={Zizhao Tong and Hongfeng Lai and Zeqing Wang and Zhaohu Xing and Kexu Cheng and Haoran Xu and Zhao Pu and Shangwen Zhu and Ruili Feng and Jian Zhao and Yan Zhang and Hao Tang and Yeying Jin and Ling Shao},
  year={2026}
}

Acknowledgements

We thank the Wan Team for open-sourcing Wan2.2 and the DiffSynth team for the inference framework.

Downloads last month
-
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for zizhaotong/SCOPE

Finetuned
(35)
this model

Collection including zizhaotong/SCOPE