sc-5445: ship lean pre-quantized Q4 snapshot (Q4 DiT 32.8->8.9GB + quantization manifest; prune redundant raw pickles; refresh card)

35d6df7 verified 12 days ago

preview code

Raw

History Blame Contribute Delete

6.16 kB

	---
	license: mit
	pipeline_tag: image-to-video
	library_name: diffusers
	tags:
	- character-animation
	- video-generation
	- cross-identity-replacement
	- pose-driven
	- diffusion
	- mlx
	- apple-silicon
	base_model: zai-org/SCAIL-2
	---

	# SceneWorks/scail2-mlx

	Turnkey, SceneWorks-converted weights of [zai-org/SCAIL-2](https://huggingface.co/zai-org/SCAIL-2) — an end-to-end controlled character-animation / motion-transfer video model — packaged for native Apple-Silicon (MLX) inference inside SceneWorks. This is not an original model; it is a format/dtype repackaging of the upstream release for first-class macOS use (no PyTorch at runtime).

	> Capabilities (from upstream): character animation from a reference image + driving video, cross-identity character replacement, zero-shot animal-driving, end-to-end and pose-rendered driving, and (experimental) multi-reference. Image output is `num_frames == 1`.

	## What changed vs. upstream
	Every component is repackaged to the safetensors layout the SceneWorks Rust/MLX loaders consume — no PyTorch at runtime:
	- DiT (`model/1/fsdp2_rank_0000_checkpoint.pt`, an FSDP2/SAT checkpoint) was key-remapped to the `SCAIL2Model` parameter naming using the upstream `convert.py` contract (fused `query_key_value`→`q`/`k`/`v`, `key_value`→`k`/`v`, `clip_feature_key_value_list`→`k_img`/`v_img`), cast fp32 → bf16, then pre-quantized to group-wise-affine Q4 on disk → `dit.safetensors`. The attention (`q`/`k`/`v`/`o` + I2V `k_img`/`v_img`) and FFN (`ffn.0`/`ffn.2`) Linears are packed (`weight` u32 codes + `scales` + `biases` via MLX `quantize`, byte-equal to `nn.quantize`, group size 64); the patch/text/time/image embeddings, norms, and output head stay dense bf16. A `config.json` `quantization` block marks the snapshot so the loader builds the quantized Linears directly from the packs (no dense bf16 materialized at load). Bit-faithful key remap (987 source keys → 1307 model keys; exact key+shape match against `SCAIL2Model.from_config(config-14b.json)`).
	- VAE (`Wan2.1_VAE.pth`, the stock Wan2.1 z16 VAE) → `vae.safetensors` (f32, channels-last conv transpose, keys unchanged — the `sanitize_wan_vae_weights` contract shared with Bernini/wan). Loaded by `mlx_gen_wan::WanVae`.
	- Text encoder (`umt5-xxl/models_t5_umt5-xxl-enc-bf16.pth`, stock UMT5-XXL) → `t5_encoder.safetensors` (bf16, sole rename `.ffn.gate.0.`→`.ffn.gate_proj.`). Loaded by `mlx_gen_wan::Umt5Encoder` with `tokenizer.json`.
	- Image encoder (`models_clip_...onlyvisual.pth`, open-CLIP XLM-RoBERTa ViT-H/14) → `clip.safetensors` (f32, de-prefixed `visual.*` keys). Loaded by `mlx_gen_scail2::ScailClip` (32-layer visual tower, `use_31_block` penultimate features).

	The converted VAE/UMT5 are byte-size-identical (modulo safetensors header) to Bernini/wan's already-validated Wan2.1 VAE + umt5-xxl safetensors — confirming SCAIL-2 ships the stock components.

	## Contents (turnkey MLX snapshot)
	\| file \| source \| loader \| notes \|
	\|---\|---\|---\|---\|
	\| `dit.safetensors` \| converted \| `Scail2Dit` \| SCAIL-2 14B DiT, Q4 packed (attn + FFN) + dense bf16 (embeds/norms/head), ~8.9 GB \|
	\| `vae.safetensors` \| converted \| `WanVae` \| Wan2.1 z16 VAE, f32, stride (4,8,8) (~0.5 GB) \|
	\| `t5_encoder.safetensors` \| converted \| `Umt5Encoder` \| UMT5-XXL encoder, bf16 (~11 GB) \|
	\| `clip.safetensors` \| converted \| `ScailClip` \| open-CLIP ViT-H/14 visual tower, f32, 1280-dim (~2.5 GB) \|
	\| `tokenizer.json` \| upstream, stock \| `load_tokenizer` \| UMT5-XXL HF tokenizer (root copy) \|
	\| `config.json` \| upstream `configs/config-14b.json` + `quantization` block \| `Scail2Config` \| `model_type: i2v`, `dim 5120`, `ffn 13824`, `40` layers/heads, `in_dim 20`, `mask_dim 28`, `out_dim 16`; `quantization: {bits 4, group_size 64}` \|
	\| `bias-aware-dpo-lora.pt` \| upstream, stock \| `mlx_gen_scail2` (sc-5451) \| optional Bias-Aware DPO refinement LoRA \|

	The DiT ships pre-quantized to Q4 on disk (the SceneWorks worker default), so the loader reads the packs directly — there is no dense-bf16 load transient. The VAE / UMT5 / CLIP ship dense (f32 / bf16). This repo ships only the loadable safetensors + tokenizer + the optional DPO LoRA; the redundant raw upstream pickles (`Wan2.1_VAE.pth`, `umt5-xxl/models_t5_...pth`, `models_clip_...onlyvisual.pth`) have been pruned — they are reproducible from the upstream release and the Rust loaders never used them.

	## Architecture (summary)
	Wan2.1-14B I2V dense DiT. Conditioning is a token-axis packed stream — reference + video + pose patch-embedded (three Conv3d stems) with additive 28-channel color-coded mask embeddings, concatenated into one self-attention sequence — plus a per-source RoPE with integer T/H/W shifts (the `replace_flag` flips the reference H-shift, toggling animation vs. replacement). The reference image is encoded by the CLIP visual tower and injected via Wan-I2V image cross-attention. Sampling is plain CFG (guide 5.0), flow-matching UniPC/DPM++.

	## Runtime (Apple Silicon)
	The production default — 832×480 / 5 s (one 81-frame driving segment) — runs the DiT in f32 compute (bf16 overflows to NaN at that packed-sequence length), with shared FFN/attention activation chunking and a temporal-tiled VAE decode, at a measured process footprint of ~70–76 GB. SceneWorks gates SCAIL-2 to 96 GB-class Macs. The Q4 DiT keeps the resident weights and the snapshot download lean (≈ 24 GB total).

	## License & attribution
	This repackaging redistributes upstream weights under the license declared on the upstream model card (MIT); the upstream code repository is Apache-2.0. Please consult and cite the original:
	- Model: https://huggingface.co/zai-org/SCAIL-2
	- Code: https://github.com/zai-org/SCAIL-2
	- Paper: SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning (arXiv:2606.10804)
	- Built on Wan2.1 (Alibaba Wan team), UMT5-XXL, and OpenCLIP.

	All credit for the model belongs to the original authors. This repo exists solely to make SCAIL-2 usable in SceneWorks on Apple Silicon.