SF3-model β Action-conditional Wan2.2-TI2V-5B for Street Fighter III
This repository hosts checkpoints from Phase 1 of the SF3 (Street Fighter III) action-conditional
video diffusion training. The model is a fine-tuned variant of
Wan-AI/Wan2.2-TI2V-5B augmented with an
ActionModule that conditions video generation on the 10-button arcade-stick action schema
(UP / DOWN / LEFT / RIGHT / Y / X / Z / A / B / C). The architecture and pipeline are reused
verbatim from the SF2 line of work (WanModelAction / our_sf2_world.WanVideoPipeline) β SF3
differs only in dataset, target resolution, and prompt strategy.
Only checkpoints whose global step is a multiple of 2000 are published in this repo.
Repository layout
Three independent training runs are kept under separate prefixes:
| Run prefix | Cold start? | Trainable params | Prompt source |
|---|---|---|---|
p1_joint_480x832_5s_verbose/ |
yes (cold) | full DiT (incl. action embedders, ~5.0B) | per-clip verbose CSV prompt |
p1_joint_480x832_5s_verbose/resume_step13000/ |
resume from verbose/step-13000 |
full DiT (incl. action embedders, ~5.0B) | per-clip verbose CSV prompt |
p1_joint_480x832_5s_fixedprompt_coldstart_freeze_xattn/ |
yes (cold) | DiT minus 30 cross-attn blocks (~4.1B); action embedders trained | fixed string "Street Fighter 3 arcade fighting game gameplay" |
File-naming inside each prefix is step-<N>.safetensors, where <N> is the global step counter
of the run (resume runs restart from 0 β see "Resume semantics" below).
Each .safetensors file is a flat state-dict for the full DiT (WanModelAction); load it via
pipe.dit.load_state_dict(...) (or use the project's train_action.py --load_action <path> flag
to warm-start a new training run). VAE and T5 weights are not included β use the same
Wan-AI/Wan2.2-TI2V-5B VAE / T5 weights as the base model.
Base model & architecture
- Backbone:
Wan-AI/Wan2.2-TI2V-5B(DiT, dim 3072, 30 layers, 24 heads, ffn 14336, in/out dim 48, patch size[1, 2, 2],seperated_timestep=True,fuse_vae_embedding_in_latents=True, no CLIP, no VAE conditioning). - Action conditioning (
ActionModule, inserted in every DiT block alongsidecross_attn):mouse_action: unused in the SF3 schema (set to zeros); slot kept for API compatibility.keyboard_action:[B, T, 10]one-hot for the SF3 arcade buttons (UP / DOWN / LEFT / RIGHT / Y / X / Z / A / B / C).- Output projections are zero-initialized, so the action branch starts as an identity residual.
- Pipeline:
our_sf2_world.WanVideoPipeline(T2VβTI2V with first-frame conditioning). - Custom code lives at
Training_Wan/diffsynth/models/our_small_dit.py(and the SF2/SF3WanModelActionvariant). Loading these checkpoints requires the matching custom code from the training repo β they are not drop-in compatible with stock Wan2.2 inference scripts.
Training data
Dataset root:
datasets/SF3/β 39,562 5-second SF III gameplay clips, each at384 Γ 224 (W Γ H)/ 20 fps / 200 frames, paired with a per-frame actionparquet(10 binary button columns, 1 row per video frame).Action upsampling: SF3 uses
_hold_last_upsample(window=10)to quantize button states to 2 Hz (hold every 0.5 s), identical to SF2 cold-start.Train CSVs published checkpoints were trained on:
metadata_train_35k.csvβ 35,000 rows. Built from the full 39,561-row train split by keeping all non-Reactive Defenserows in full and randomly subsamplingReactive Defensefrom 18,171 β 13,610 (pandas.sample(random_state=0)), thensort_index(). This brings RD from 45.93% β 38.89% so it no longer drowns out the action signal. Used byp1_joint_480x832_5s_verbose/(both the cold-start and the resume_step13000 runs). Built viaexamples/sf3/scripts/build_train_35k.py.metadata_train_10k_balanced.csvβ 10,002 rows. 8 strategy buckets capped at 1,945 each (rare strategies βDefensive Zoning219,Zoning and Anti-Air57,Passive Defense1 β taken in full); typoDefenisvenormalized toDefensive; bracketed variants ([reactive]/[passive]/[aggressive]) consolidated into base strategies. Built 2026-04-26 frommetadata_train_35k.csvwithrandom_state=42. Used byp1_joint_480x832_5s_fixedprompt_coldstart_freeze_xattn/.
Strategy distribution (
metadata_train_35k.csv):Strategy Rows Share Reactive Defense 13,610 38.89% Evasive Defense 10,302 29.43% Aggressive Approach 5,602 16.01% Offensive Pressure 2,692 7.69% Hit and Run 2,517 7.19% Defensive Zoning 218 0.62% Zoning and Anti-Air 57 0.16% Passive Defense / Defenisve Zoning 2 0.01% Slice length per clip: first 101 frames (~5 s) at training time (
--num_frames 101), aligned with SF2 Round 5.Train cache (used by the
fixedpromptrun only):datasets/SF3/cache_480x832_5s_fixed/β VAE latents and T5 embeddings precomputed offline.video/andfirst_frame/are symlinked fromcache_480x832_5s/(39,562.pteach, prompt-independent and reusable).t5/contains 2 entries: the SF3 fixed prompt and the empty string. Loading from cache drops VAE+T5 from GPU after init (~11 GB saved) and gives ~30β50 % per-step speedup vs the non-cached run.
Training method
All runs use the same launcher (accelerate launch --num_processes=8 --multi_gpu) on
8Γ H200 (143 GB), with batch size 1/GPU and accumulation 1 (effective bs = 8).
Optimizer is the project default; loss is FlowMatch SFT loss; gradient checkpointing on,
gc_offload off (~99 GB/card; gc_offload on would drop to ~74 GB but cost 2β3Γ slower
steps).
Common hyper-parameters
| Item | Value |
|---|---|
| Resolution (H Γ W) | 480 Γ 832 (aspect 1.733, +1% vs source 1.714) |
| Frames | 101 (5 s @ ~20 fps) |
| Token budget | 51 (latent T) Γ 15 (H/32) Γ 26 (W/32) = 19,890 patched tokens |
| Optimizer learning rate | 5e-5 |
| Save interval | every 1,000 steps |
enable_action |
true |
action_hold_window |
10 |
action_dropout_prob |
0.0 |
prompt_dropout_prob |
0.1 (CFG-friendly) |
dataset_repeat |
1 |
dataset_num_workers |
4 |
| Mixed precision | bf16 (Wan2.2 default) |
| Hardware | 8Γ NVIDIA H200 143 GB |
Run-specific differences
1. p1_joint_480x832_5s_verbose/ (cold start)
- Script:
examples/sf2/model_training/action_5s/train_action.py - Launch:
examples/sf3/model_training/action_5s/attempt_joint_5s_480x832.sh - Init: cold start from base Wan2.2-TI2V-5B (no
--load_action, no--load_mg3, no--fuse_loRA).action_embeddersrandomly initialized (residual zero-init inside module). - Trainable:
--trainable_models dit(full-parameter; includesaction_embeddersand cross-attn). - Prompt source:
--use_csv_prompt --prompt_column prompt(per-clip verbose prompt from CSV).
2. p1_joint_480x832_5s_verbose/resume_step13000/ (warm restart)
- Script: same
train_action.py, but launched with--load_action .../p1_joint_480x832_5s_verbose/step-13000.safetensors. - Launch:
examples/sf3/model_training/action_5s/attempt_joint_5s_480x832_resume13k.sh - Resume semantics: only DiT weights are restored from
step-13000. Optimizer momentum, LR schedule, RNG, and dataloader position are not restored β the global step counter restarts from 0, so files in this folder namedstep-Ncorrespond to global step13000 + Nof the conceptual run. Expect a small loss bump in the first ~500 steps (optimizer warm-up). - All other flags identical to the cold start above.
3. p1_joint_480x832_5s_fixedprompt_coldstart_freeze_xattn/ (cold + freeze cross-attn)
- Script:
examples/sf2/model_training/action_5s/train_action_cached.py(cached / precomputed-latents trainer). - Launch:
examples/sf3/model_training/action_5s/attempt_joint_5s_480x832_cached_fixedprompt_coldstart_freeze_xattn.sh - Init: cold start from base Wan2.2-TI2V-5B;
action_embeddersrandomly initialized. - Prompt source: fixed prompt
"Street Fighter 3 arcade fighting game gameplay"applied to every CSV row (no--use_csv_promptβ verbose per-clip prompts in the CSV are ignored). - Trainable:
--trainable_models dit --freeze_filter cross_attn. All 30DiTBlock.cross_attnmodules (q/k/v/o + norm3) haverequires_grad=False. Updates flow throughself_attn,ffn, norms,patch_embed, time/text/img embedders, output head, andaction_embeddersβ roughly ~4.1B trainable (vs ~5.0B in the verbose runs). - Training data:
metadata_train_10k_balanced.csv(10,002 rows, see above), via the cachecache_480x832_5s_fixed/.
Available checkpoints (steps that are multiples of 2000)
p1_joint_480x832_5s_verbose/: 2k, 4k, 6k, 8k, 10k, 12kp1_joint_480x832_5s_verbose/resume_step13000/: 2k, 4k, 6k, 8k, 10k, 12k, 14k, 16k, 18k, 20k, 22k, 24k, 26k, 28k, 30k, 32k, 34k, 36k (mental-map to global steps 15k / 17k / ... / 49k)p1_joint_480x832_5s_fixedprompt_coldstart_freeze_xattn/: 2k, 4k, 6k, 8k, 10k, 12k, 14k, 16k, 18k, 20k, 22k, 24k, 26k
Intermediate (1k / 3k / 5k / ...) checkpoints exist in our local training output but are not uploaded here.
Loading a checkpoint
These weights require the custom WanModelAction code from the training repo
(Training_Wan/diffsynth/models/our_small_dit.py and friends). Rough sketch:
from diffsynth.pipelines.our_sf2_world import WanVideoPipeline
import torch, safetensors.torch
pipe = WanVideoPipeline.from_pretrained(
torch_dtype=torch.bfloat16, device="cuda",
# ... point at your local Wan2.2-TI2V-5B + UMT5 weights ...
enable_action=True,
)
state = safetensors.torch.load_file(
"p1_joint_480x832_5s_verbose/resume_step13000/step-36000.safetensors"
)
pipe.dit.load_state_dict(state, strict=False)
For training continuation, pass the file path via
--load_action /path/to/step-N.safetensors to the SF2/SF3 train_action*.py scripts.
Notes & caveats
- These are research checkpoints from an in-progress training campaign β not a polished
release. The
verbosecold start and theresume_step13000continuation form one conceptual training trajectory; thefixedprompt_coldstart_freeze_xattnrun is an independent ablation isolating the contribution of cross-attn updates and per-clip prompts. - VAE / T5 weights are not bundled. Use the originals from
Wan-AI/Wan2.2-TI2V-5BandWan-AI/Wan2.1-T2V-1.3B(UMT5 tokenizer). - Action input expects the SF2/SF3 10-button schema in the order
UP / DOWN / LEFT / RIGHT / Y / X / Z / A / B / C; mismatched ordering will silently produce garbage. - For the
freeze_xattnrun, prompts at inference time should match the training-time fixed prompt ("Street Fighter 3 arcade fighting game gameplay") β the cross-attn layers were frozen and never adapted to per-clip text.
Citation / acknowledgement
Built on top of Wan-AI/Wan2.2-TI2V-5B and a research fork of modelscope/DiffSynth-Studio.
Model tree for INV-WZQ/SF3-model
Base model
Wan-AI/Wan2.2-TI2V-5B