SF3-model — Action-conditional Wan2.2-TI2V-5B for Street Fighter III

This repository hosts checkpoints from Phase 1 of the SF3 (Street Fighter III) action-conditional video diffusion training. The model is a fine-tuned variant of Wan-AI/Wan2.2-TI2V-5B augmented with an ActionModule that conditions video generation on the 10-button arcade-stick action schema (UP / DOWN / LEFT / RIGHT / Y / X / Z / A / B / C). The architecture and pipeline are reused verbatim from the SF2 line of work (WanModelAction / our_sf2_world.WanVideoPipeline) — SF3 differs only in dataset, target resolution, and prompt strategy.

Only checkpoints whose global step is a multiple of 2000 are published in this repo.

Repository layout

Three independent training runs are kept under separate prefixes:

Run prefix	Cold start?	Trainable params	Prompt source
`p1_joint_480x832_5s_verbose/`	yes (cold)	full DiT (incl. action embedders, ~5.0B)	per-clip verbose CSV prompt
`p1_joint_480x832_5s_verbose/resume_step13000/`	resume from `verbose/step-13000`	full DiT (incl. action embedders, ~5.0B)	per-clip verbose CSV prompt
`p1_joint_480x832_5s_fixedprompt_coldstart_freeze_xattn/`	yes (cold)	DiT minus 30 cross-attn blocks (~4.1B); action embedders trained	fixed string `"Street Fighter 3 arcade fighting game gameplay"`

File-naming inside each prefix is step-<N>.safetensors, where <N> is the global step counter of the run (resume runs restart from 0 — see "Resume semantics" below).

Each .safetensors file is a flat state-dict for the full DiT (WanModelAction); load it via pipe.dit.load_state_dict(...) (or use the project's train_action.py --load_action <path> flag to warm-start a new training run). VAE and T5 weights are not included — use the same Wan-AI/Wan2.2-TI2V-5B VAE / T5 weights as the base model.

Base model & architecture

Backbone: Wan-AI/Wan2.2-TI2V-5B (DiT, dim 3072, 30 layers, 24 heads, ffn 14336, in/out dim 48, patch size [1, 2, 2], seperated_timestep=True, fuse_vae_embedding_in_latents=True, no CLIP, no VAE conditioning).
Action conditioning (ActionModule, inserted in every DiT block alongside cross_attn):
- mouse_action: unused in the SF3 schema (set to zeros); slot kept for API compatibility.
- keyboard_action: [B, T, 10] one-hot for the SF3 arcade buttons (UP / DOWN / LEFT / RIGHT / Y / X / Z / A / B / C).
- Output projections are zero-initialized, so the action branch starts as an identity residual.
Pipeline: our_sf2_world.WanVideoPipeline (T2V→TI2V with first-frame conditioning).
Custom code lives at Training_Wan/diffsynth/models/our_small_dit.py (and the SF2/SF3 WanModelAction variant). Loading these checkpoints requires the matching custom code from the training repo — they are not drop-in compatible with stock Wan2.2 inference scripts.

Training data

Dataset root: datasets/SF3/ — 39,562 5-second SF III gameplay clips, each at 384 × 224 (W × H) / 20 fps / 200 frames, paired with a per-frame action parquet (10 binary button columns, 1 row per video frame).
Action upsampling: SF3 uses _hold_last_upsample(window=10) to quantize button states to 2 Hz (hold every 0.5 s), identical to SF2 cold-start.
Train CSVs published checkpoints were trained on:
- metadata_train_35k.csv — 35,000 rows. Built from the full 39,561-row train split by keeping all non-Reactive Defense rows in full and randomly subsampling Reactive Defense from 18,171 → 13,610 (pandas.sample(random_state=0)), then sort_index(). This brings RD from 45.93% → 38.89% so it no longer drowns out the action signal. Used by p1_joint_480x832_5s_verbose/ (both the cold-start and the resume_step13000 runs). Built via examples/sf3/scripts/build_train_35k.py.
- metadata_train_10k_balanced.csv — 10,002 rows. 8 strategy buckets capped at 1,945 each (rare strategies — Defensive Zoning 219, Zoning and Anti-Air 57, Passive Defense 1 — taken in full); typo Defenisve normalized to Defensive; bracketed variants ([reactive]/[passive]/[aggressive]) consolidated into base strategies. Built 2026-04-26 from metadata_train_35k.csv with random_state=42. Used by p1_joint_480x832_5s_fixedprompt_coldstart_freeze_xattn/.

Strategy distribution (metadata_train_35k.csv):

Strategy	Rows	Share
Reactive Defense	13,610	38.89%
Evasive Defense	10,302	29.43%
Aggressive Approach	5,602	16.01%
Offensive Pressure	2,692	7.69%
Hit and Run	2,517	7.19%
Defensive Zoning	218	0.62%
Zoning and Anti-Air	57	0.16%
Passive Defense / Defenisve Zoning	2	0.01%

Slice length per clip: first 101 frames (~5 s) at training time (--num_frames 101), aligned with SF2 Round 5.
Train cache (used by the fixedprompt run only): datasets/SF3/cache_480x832_5s_fixed/ — VAE latents and T5 embeddings precomputed offline. video/ and first_frame/ are symlinked from cache_480x832_5s/ (39,562 .pt each, prompt-independent and reusable). t5/ contains 2 entries: the SF3 fixed prompt and the empty string. Loading from cache drops VAE+T5 from GPU after init (~11 GB saved) and gives ~30–50 % per-step speedup vs the non-cached run.

Training method

All runs use the same launcher (accelerate launch --num_processes=8 --multi_gpu) on 8× H200 (143 GB), with batch size 1/GPU and accumulation 1 (effective bs = 8). Optimizer is the project default; loss is FlowMatch SFT loss; gradient checkpointing on, gc_offload off (~99 GB/card; gc_offload on would drop to ~74 GB but cost 2–3× slower steps).

Common hyper-parameters

Item	Value
Resolution (H × W)	480 × 832 (aspect 1.733, +1% vs source 1.714)
Frames	101 (5 s @ ~20 fps)
Token budget	51 (latent T) × 15 (H/32) × 26 (W/32) = 19,890 patched tokens
Optimizer learning rate	5e-5
Save interval	every 1,000 steps
`enable_action`	true
`action_hold_window`	10
`action_dropout_prob`	0.0
`prompt_dropout_prob`	0.1 (CFG-friendly)
`dataset_repeat`	1
`dataset_num_workers`	4
Mixed precision	bf16 (Wan2.2 default)
Hardware	8× NVIDIA H200 143 GB

Run-specific differences

1. `p1_joint_480x832_5s_verbose/` (cold start)

Script: examples/sf2/model_training/action_5s/train_action.py
Launch: examples/sf3/model_training/action_5s/attempt_joint_5s_480x832.sh
Init: cold start from base Wan2.2-TI2V-5B (no --load_action, no --load_mg3, no --fuse_loRA). action_embedders randomly initialized (residual zero-init inside module).
Trainable: --trainable_models dit (full-parameter; includes action_embedders and cross-attn).
Prompt source: --use_csv_prompt --prompt_column prompt (per-clip verbose prompt from CSV).

2. `p1_joint_480x832_5s_verbose/resume_step13000/` (warm restart)

Script: same train_action.py, but launched with --load_action .../p1_joint_480x832_5s_verbose/step-13000.safetensors.
Launch: examples/sf3/model_training/action_5s/attempt_joint_5s_480x832_resume13k.sh
Resume semantics: only DiT weights are restored from step-13000. Optimizer momentum, LR schedule, RNG, and dataloader position are not restored — the global step counter restarts from 0, so files in this folder named step-N correspond to global step 13000 + N of the conceptual run. Expect a small loss bump in the first ~500 steps (optimizer warm-up).
All other flags identical to the cold start above.

3. `p1_joint_480x832_5s_fixedprompt_coldstart_freeze_xattn/` (cold + freeze cross-attn)

Script: examples/sf2/model_training/action_5s/train_action_cached.py (cached / precomputed-latents trainer).
Launch: examples/sf3/model_training/action_5s/attempt_joint_5s_480x832_cached_fixedprompt_coldstart_freeze_xattn.sh
Init: cold start from base Wan2.2-TI2V-5B; action_embedders randomly initialized.
Prompt source: fixed prompt "Street Fighter 3 arcade fighting game gameplay" applied to every CSV row (no --use_csv_prompt — verbose per-clip prompts in the CSV are ignored).
Trainable: --trainable_models dit --freeze_filter cross_attn. All 30 DiTBlock.cross_attn modules (q/k/v/o + norm3) have requires_grad=False. Updates flow through self_attn, ffn, norms, patch_embed, time/text/img embedders, output head, and action_embedders — roughly ~4.1B trainable (vs ~5.0B in the verbose runs).
Training data: metadata_train_10k_balanced.csv (10,002 rows, see above), via the cache cache_480x832_5s_fixed/.

Available checkpoints (steps that are multiples of 2000)

p1_joint_480x832_5s_verbose/: 2k, 4k, 6k, 8k, 10k, 12k
p1_joint_480x832_5s_verbose/resume_step13000/: 2k, 4k, 6k, 8k, 10k, 12k, 14k, 16k, 18k, 20k, 22k, 24k, 26k, 28k, 30k, 32k, 34k, 36k (mental-map to global steps 15k / 17k / ... / 49k)
p1_joint_480x832_5s_fixedprompt_coldstart_freeze_xattn/: 2k, 4k, 6k, 8k, 10k, 12k, 14k, 16k, 18k, 20k, 22k, 24k, 26k

Intermediate (1k / 3k / 5k / ...) checkpoints exist in our local training output but are not uploaded here.

Loading a checkpoint

These weights require the custom WanModelAction code from the training repo (Training_Wan/diffsynth/models/our_small_dit.py and friends). Rough sketch:

from diffsynth.pipelines.our_sf2_world import WanVideoPipeline
import torch, safetensors.torch

pipe = WanVideoPipeline.from_pretrained(
    torch_dtype=torch.bfloat16, device="cuda",
    # ... point at your local Wan2.2-TI2V-5B + UMT5 weights ...
    enable_action=True,
)
state = safetensors.torch.load_file(
    "p1_joint_480x832_5s_verbose/resume_step13000/step-36000.safetensors"
)
pipe.dit.load_state_dict(state, strict=False)

For training continuation, pass the file path via --load_action /path/to/step-N.safetensors to the SF2/SF3 train_action*.py scripts.

Notes & caveats

These are research checkpoints from an in-progress training campaign — not a polished release. The verbose cold start and the resume_step13000 continuation form one conceptual training trajectory; the fixedprompt_coldstart_freeze_xattn run is an independent ablation isolating the contribution of cross-attn updates and per-clip prompts.
VAE / T5 weights are not bundled. Use the originals from Wan-AI/Wan2.2-TI2V-5B and Wan-AI/Wan2.1-T2V-1.3B (UMT5 tokenizer).
Action input expects the SF2/SF3 10-button schema in the order UP / DOWN / LEFT / RIGHT / Y / X / Z / A / B / C; mismatched ordering will silently produce garbage.
For the freeze_xattn run, prompts at inference time should match the training-time fixed prompt ("Street Fighter 3 arcade fighting game gameplay") — the cross-attn layers were frozen and never adapted to per-clip text.

Citation / acknowledgement

Built on top of Wan-AI/Wan2.2-TI2V-5B and a research fork of modelscope/DiffSynth-Studio.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for INV-WZQ/SF3-model

Base model

Wan-AI/Wan2.2-TI2V-5B

Finetuned

(25)

this model