SF3-model β€” Action-conditional Wan2.2-TI2V-5B for Street Fighter III

This repository hosts checkpoints from Phase 1 of the SF3 (Street Fighter III) action-conditional video diffusion training. The model is a fine-tuned variant of Wan-AI/Wan2.2-TI2V-5B augmented with an ActionModule that conditions video generation on the 10-button arcade-stick action schema (UP / DOWN / LEFT / RIGHT / Y / X / Z / A / B / C). The architecture and pipeline are reused verbatim from the SF2 line of work (WanModelAction / our_sf2_world.WanVideoPipeline) β€” SF3 differs only in dataset, target resolution, and prompt strategy.

Only checkpoints whose global step is a multiple of 2000 are published in this repo.


Repository layout

Three independent training runs are kept under separate prefixes:

Run prefix Cold start? Trainable params Prompt source
p1_joint_480x832_5s_verbose/ yes (cold) full DiT (incl. action embedders, ~5.0B) per-clip verbose CSV prompt
p1_joint_480x832_5s_verbose/resume_step13000/ resume from verbose/step-13000 full DiT (incl. action embedders, ~5.0B) per-clip verbose CSV prompt
p1_joint_480x832_5s_fixedprompt_coldstart_freeze_xattn/ yes (cold) DiT minus 30 cross-attn blocks (~4.1B); action embedders trained fixed string "Street Fighter 3 arcade fighting game gameplay"

File-naming inside each prefix is step-<N>.safetensors, where <N> is the global step counter of the run (resume runs restart from 0 β€” see "Resume semantics" below).

Each .safetensors file is a flat state-dict for the full DiT (WanModelAction); load it via pipe.dit.load_state_dict(...) (or use the project's train_action.py --load_action <path> flag to warm-start a new training run). VAE and T5 weights are not included β€” use the same Wan-AI/Wan2.2-TI2V-5B VAE / T5 weights as the base model.


Base model & architecture

  • Backbone: Wan-AI/Wan2.2-TI2V-5B (DiT, dim 3072, 30 layers, 24 heads, ffn 14336, in/out dim 48, patch size [1, 2, 2], seperated_timestep=True, fuse_vae_embedding_in_latents=True, no CLIP, no VAE conditioning).
  • Action conditioning (ActionModule, inserted in every DiT block alongside cross_attn):
    • mouse_action: unused in the SF3 schema (set to zeros); slot kept for API compatibility.
    • keyboard_action: [B, T, 10] one-hot for the SF3 arcade buttons (UP / DOWN / LEFT / RIGHT / Y / X / Z / A / B / C).
    • Output projections are zero-initialized, so the action branch starts as an identity residual.
  • Pipeline: our_sf2_world.WanVideoPipeline (T2Vβ†’TI2V with first-frame conditioning).
  • Custom code lives at Training_Wan/diffsynth/models/our_small_dit.py (and the SF2/SF3 WanModelAction variant). Loading these checkpoints requires the matching custom code from the training repo β€” they are not drop-in compatible with stock Wan2.2 inference scripts.

Training data

  • Dataset root: datasets/SF3/ β€” 39,562 5-second SF III gameplay clips, each at 384 Γ— 224 (W Γ— H) / 20 fps / 200 frames, paired with a per-frame action parquet (10 binary button columns, 1 row per video frame).

  • Action upsampling: SF3 uses _hold_last_upsample(window=10) to quantize button states to 2 Hz (hold every 0.5 s), identical to SF2 cold-start.

  • Train CSVs published checkpoints were trained on:

    • metadata_train_35k.csv β€” 35,000 rows. Built from the full 39,561-row train split by keeping all non-Reactive Defense rows in full and randomly subsampling Reactive Defense from 18,171 β†’ 13,610 (pandas.sample(random_state=0)), then sort_index(). This brings RD from 45.93% β†’ 38.89% so it no longer drowns out the action signal. Used by p1_joint_480x832_5s_verbose/ (both the cold-start and the resume_step13000 runs). Built via examples/sf3/scripts/build_train_35k.py.
    • metadata_train_10k_balanced.csv β€” 10,002 rows. 8 strategy buckets capped at 1,945 each (rare strategies β€” Defensive Zoning 219, Zoning and Anti-Air 57, Passive Defense 1 β€” taken in full); typo Defenisve normalized to Defensive; bracketed variants ([reactive]/[passive]/[aggressive]) consolidated into base strategies. Built 2026-04-26 from metadata_train_35k.csv with random_state=42. Used by p1_joint_480x832_5s_fixedprompt_coldstart_freeze_xattn/.
  • Strategy distribution (metadata_train_35k.csv):

    Strategy Rows Share
    Reactive Defense 13,610 38.89%
    Evasive Defense 10,302 29.43%
    Aggressive Approach 5,602 16.01%
    Offensive Pressure 2,692 7.69%
    Hit and Run 2,517 7.19%
    Defensive Zoning 218 0.62%
    Zoning and Anti-Air 57 0.16%
    Passive Defense / Defenisve Zoning 2 0.01%
  • Slice length per clip: first 101 frames (~5 s) at training time (--num_frames 101), aligned with SF2 Round 5.

  • Train cache (used by the fixedprompt run only): datasets/SF3/cache_480x832_5s_fixed/ β€” VAE latents and T5 embeddings precomputed offline. video/ and first_frame/ are symlinked from cache_480x832_5s/ (39,562 .pt each, prompt-independent and reusable). t5/ contains 2 entries: the SF3 fixed prompt and the empty string. Loading from cache drops VAE+T5 from GPU after init (~11 GB saved) and gives ~30–50 % per-step speedup vs the non-cached run.


Training method

All runs use the same launcher (accelerate launch --num_processes=8 --multi_gpu) on 8Γ— H200 (143 GB), with batch size 1/GPU and accumulation 1 (effective bs = 8). Optimizer is the project default; loss is FlowMatch SFT loss; gradient checkpointing on, gc_offload off (~99 GB/card; gc_offload on would drop to ~74 GB but cost 2–3Γ— slower steps).

Common hyper-parameters

Item Value
Resolution (H Γ— W) 480 Γ— 832 (aspect 1.733, +1% vs source 1.714)
Frames 101 (5 s @ ~20 fps)
Token budget 51 (latent T) Γ— 15 (H/32) Γ— 26 (W/32) = 19,890 patched tokens
Optimizer learning rate 5e-5
Save interval every 1,000 steps
enable_action true
action_hold_window 10
action_dropout_prob 0.0
prompt_dropout_prob 0.1 (CFG-friendly)
dataset_repeat 1
dataset_num_workers 4
Mixed precision bf16 (Wan2.2 default)
Hardware 8Γ— NVIDIA H200 143 GB

Run-specific differences

1. p1_joint_480x832_5s_verbose/ (cold start)

  • Script: examples/sf2/model_training/action_5s/train_action.py
  • Launch: examples/sf3/model_training/action_5s/attempt_joint_5s_480x832.sh
  • Init: cold start from base Wan2.2-TI2V-5B (no --load_action, no --load_mg3, no --fuse_loRA). action_embedders randomly initialized (residual zero-init inside module).
  • Trainable: --trainable_models dit (full-parameter; includes action_embedders and cross-attn).
  • Prompt source: --use_csv_prompt --prompt_column prompt (per-clip verbose prompt from CSV).

2. p1_joint_480x832_5s_verbose/resume_step13000/ (warm restart)

  • Script: same train_action.py, but launched with --load_action .../p1_joint_480x832_5s_verbose/step-13000.safetensors.
  • Launch: examples/sf3/model_training/action_5s/attempt_joint_5s_480x832_resume13k.sh
  • Resume semantics: only DiT weights are restored from step-13000. Optimizer momentum, LR schedule, RNG, and dataloader position are not restored β€” the global step counter restarts from 0, so files in this folder named step-N correspond to global step 13000 + N of the conceptual run. Expect a small loss bump in the first ~500 steps (optimizer warm-up).
  • All other flags identical to the cold start above.

3. p1_joint_480x832_5s_fixedprompt_coldstart_freeze_xattn/ (cold + freeze cross-attn)

  • Script: examples/sf2/model_training/action_5s/train_action_cached.py (cached / precomputed-latents trainer).
  • Launch: examples/sf3/model_training/action_5s/attempt_joint_5s_480x832_cached_fixedprompt_coldstart_freeze_xattn.sh
  • Init: cold start from base Wan2.2-TI2V-5B; action_embedders randomly initialized.
  • Prompt source: fixed prompt "Street Fighter 3 arcade fighting game gameplay" applied to every CSV row (no --use_csv_prompt β€” verbose per-clip prompts in the CSV are ignored).
  • Trainable: --trainable_models dit --freeze_filter cross_attn. All 30 DiTBlock.cross_attn modules (q/k/v/o + norm3) have requires_grad=False. Updates flow through self_attn, ffn, norms, patch_embed, time/text/img embedders, output head, and action_embedders β€” roughly ~4.1B trainable (vs ~5.0B in the verbose runs).
  • Training data: metadata_train_10k_balanced.csv (10,002 rows, see above), via the cache cache_480x832_5s_fixed/.

Available checkpoints (steps that are multiples of 2000)

  • p1_joint_480x832_5s_verbose/: 2k, 4k, 6k, 8k, 10k, 12k
  • p1_joint_480x832_5s_verbose/resume_step13000/: 2k, 4k, 6k, 8k, 10k, 12k, 14k, 16k, 18k, 20k, 22k, 24k, 26k, 28k, 30k, 32k, 34k, 36k (mental-map to global steps 15k / 17k / ... / 49k)
  • p1_joint_480x832_5s_fixedprompt_coldstart_freeze_xattn/: 2k, 4k, 6k, 8k, 10k, 12k, 14k, 16k, 18k, 20k, 22k, 24k, 26k

Intermediate (1k / 3k / 5k / ...) checkpoints exist in our local training output but are not uploaded here.


Loading a checkpoint

These weights require the custom WanModelAction code from the training repo (Training_Wan/diffsynth/models/our_small_dit.py and friends). Rough sketch:

from diffsynth.pipelines.our_sf2_world import WanVideoPipeline
import torch, safetensors.torch

pipe = WanVideoPipeline.from_pretrained(
    torch_dtype=torch.bfloat16, device="cuda",
    # ... point at your local Wan2.2-TI2V-5B + UMT5 weights ...
    enable_action=True,
)
state = safetensors.torch.load_file(
    "p1_joint_480x832_5s_verbose/resume_step13000/step-36000.safetensors"
)
pipe.dit.load_state_dict(state, strict=False)

For training continuation, pass the file path via --load_action /path/to/step-N.safetensors to the SF2/SF3 train_action*.py scripts.


Notes & caveats

  • These are research checkpoints from an in-progress training campaign β€” not a polished release. The verbose cold start and the resume_step13000 continuation form one conceptual training trajectory; the fixedprompt_coldstart_freeze_xattn run is an independent ablation isolating the contribution of cross-attn updates and per-clip prompts.
  • VAE / T5 weights are not bundled. Use the originals from Wan-AI/Wan2.2-TI2V-5B and Wan-AI/Wan2.1-T2V-1.3B (UMT5 tokenizer).
  • Action input expects the SF2/SF3 10-button schema in the order UP / DOWN / LEFT / RIGHT / Y / X / Z / A / B / C; mismatched ordering will silently produce garbage.
  • For the freeze_xattn run, prompts at inference time should match the training-time fixed prompt ("Street Fighter 3 arcade fighting game gameplay") β€” the cross-attn layers were frozen and never adapted to per-clip text.

Citation / acknowledgement

Built on top of Wan-AI/Wan2.2-TI2V-5B and a research fork of modelscope/DiffSynth-Studio.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for INV-WZQ/SF3-model

Finetuned
(25)
this model