Image-to-Video
Diffusers
MLX
i2v
character-animation
video-generation
cross-identity-replacement
pose-driven
diffusion
apple-silicon
Instructions to use SceneWorks/scail2-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use SceneWorks/scail2-mlx with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline from diffusers.utils import load_image, export_to_video # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("SceneWorks/scail2-mlx", dtype=torch.bfloat16, device_map="cuda") pipe.to("cuda") prompt = "A man with short gray hair plays a red electric guitar." image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png" ) output = pipe(image=image, prompt=prompt).frames[0] export_to_video(output, "output.mp4") - MLX
How to use SceneWorks/scail2-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir scail2-mlx SceneWorks/scail2-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
sc-5445: ship lean pre-quantized Q4 snapshot (Q4 DiT 32.8->8.9GB + quantization manifest; prune redundant raw pickles; refresh card)
35d6df7 verified | license: mit | |
| pipeline_tag: image-to-video | |
| library_name: diffusers | |
| tags: | |
| - character-animation | |
| - video-generation | |
| - cross-identity-replacement | |
| - pose-driven | |
| - diffusion | |
| - mlx | |
| - apple-silicon | |
| base_model: zai-org/SCAIL-2 | |
| # SceneWorks/scail2-mlx | |
| **Turnkey, SceneWorks-converted weights of [zai-org/SCAIL-2](https://huggingface.co/zai-org/SCAIL-2)** β an end-to-end controlled **character-animation / motion-transfer** video model β packaged for **native Apple-Silicon (MLX)** inference inside SceneWorks. This is **not** an original model; it is a format/dtype repackaging of the upstream release for first-class macOS use (no PyTorch at runtime). | |
| > Capabilities (from upstream): character animation from a reference image + driving video, **cross-identity character replacement**, zero-shot animal-driving, end-to-end *and* pose-rendered driving, and (experimental) multi-reference. Image output is `num_frames == 1`. | |
| ## What changed vs. upstream | |
| Every component is repackaged to the safetensors layout the SceneWorks Rust/MLX loaders consume β no PyTorch at runtime: | |
| - **DiT** (`model/1/fsdp2_rank_0000_checkpoint.pt`, an FSDP2/SAT checkpoint) was key-remapped to the `SCAIL2Model` parameter naming using the upstream `convert.py` contract (fused `query_key_value`β`q`/`k`/`v`, `key_value`β`k`/`v`, `clip_feature_key_value_list`β`k_img`/`v_img`), cast **fp32 β bf16**, then **pre-quantized to group-wise-affine Q4** on disk β `dit.safetensors`. The attention (`q`/`k`/`v`/`o` + I2V `k_img`/`v_img`) and FFN (`ffn.0`/`ffn.2`) Linears are packed (`weight` u32 codes + `scales` + `biases` via MLX `quantize`, byte-equal to `nn.quantize`, group size 64); the patch/text/time/image embeddings, norms, and output head stay dense bf16. A `config.json` `quantization` block marks the snapshot so the loader builds the quantized Linears directly from the packs (no dense bf16 materialized at load). Bit-faithful key remap (987 source keys β 1307 model keys; exact key+shape match against `SCAIL2Model.from_config(config-14b.json)`). | |
| - **VAE** (`Wan2.1_VAE.pth`, the stock Wan2.1 z16 VAE) β `vae.safetensors` (**f32**, channels-last conv transpose, keys unchanged β the `sanitize_wan_vae_weights` contract shared with Bernini/wan). Loaded by `mlx_gen_wan::WanVae`. | |
| - **Text encoder** (`umt5-xxl/models_t5_umt5-xxl-enc-bf16.pth`, stock UMT5-XXL) β `t5_encoder.safetensors` (**bf16**, sole rename `.ffn.gate.0.`β`.ffn.gate_proj.`). Loaded by `mlx_gen_wan::Umt5Encoder` with `tokenizer.json`. | |
| - **Image encoder** (`models_clip_...onlyvisual.pth`, open-CLIP XLM-RoBERTa ViT-H/14) β `clip.safetensors` (**f32**, de-prefixed `visual.*` keys). Loaded by `mlx_gen_scail2::ScailClip` (32-layer visual tower, `use_31_block` penultimate features). | |
| The converted VAE/UMT5 are byte-size-identical (modulo safetensors header) to Bernini/wan's already-validated Wan2.1 VAE + umt5-xxl safetensors β confirming SCAIL-2 ships the stock components. | |
| ## Contents (turnkey MLX snapshot) | |
| | file | source | loader | notes | | |
| |---|---|---|---| | |
| | `dit.safetensors` | converted | `Scail2Dit` | SCAIL-2 14B DiT, **Q4 packed** (attn + FFN) + dense bf16 (embeds/norms/head), ~8.9 GB | | |
| | `vae.safetensors` | converted | `WanVae` | Wan2.1 z16 VAE, **f32**, stride (4,8,8) (~0.5 GB) | | |
| | `t5_encoder.safetensors` | converted | `Umt5Encoder` | UMT5-XXL encoder, **bf16** (~11 GB) | | |
| | `clip.safetensors` | converted | `ScailClip` | open-CLIP ViT-H/14 visual tower, **f32**, 1280-dim (~2.5 GB) | | |
| | `tokenizer.json` | upstream, stock | `load_tokenizer` | UMT5-XXL HF tokenizer (root copy) | | |
| | `config.json` | upstream `configs/config-14b.json` + `quantization` block | `Scail2Config` | `model_type: i2v`, `dim 5120`, `ffn 13824`, `40` layers/heads, `in_dim 20`, `mask_dim 28`, `out_dim 16`; `quantization: {bits 4, group_size 64}` | | |
| | `bias-aware-dpo-lora.pt` | upstream, stock | `mlx_gen_scail2` (sc-5451) | optional Bias-Aware DPO refinement LoRA | | |
| The DiT ships **pre-quantized to Q4 on disk** (the SceneWorks worker default), so the loader reads the packs directly β there is no dense-bf16 load transient. The VAE / UMT5 / CLIP ship dense (f32 / bf16). This repo ships **only** the loadable safetensors + tokenizer + the optional DPO LoRA; the redundant raw upstream pickles (`Wan2.1_VAE.pth`, `umt5-xxl/models_t5_...pth`, `models_clip_...onlyvisual.pth`) have been **pruned** β they are reproducible from the upstream release and the Rust loaders never used them. | |
| ## Architecture (summary) | |
| Wan2.1-14B **I2V** dense DiT. Conditioning is a **token-axis packed** stream β reference + video + pose patch-embedded (three Conv3d stems) with additive 28-channel color-coded mask embeddings, concatenated into one self-attention sequence β plus a **per-source RoPE** with integer T/H/W shifts (the `replace_flag` flips the reference H-shift, toggling animation vs. replacement). The reference image is encoded by the CLIP visual tower and injected via Wan-I2V image cross-attention. Sampling is plain CFG (guide 5.0), flow-matching UniPC/DPM++. | |
| ## Runtime (Apple Silicon) | |
| The production default β **832Γ480 / 5 s** (one 81-frame driving segment) β runs the DiT in **f32 compute** (bf16 overflows to NaN at that packed-sequence length), with shared FFN/attention activation chunking and a temporal-tiled VAE decode, at a measured process footprint of **~70β76 GB**. SceneWorks gates SCAIL-2 to **96 GB**-class Macs. The Q4 DiT keeps the resident weights and the snapshot download lean (β 24 GB total). | |
| ## License & attribution | |
| This repackaging redistributes upstream weights under the license declared on the upstream model card (**MIT**); the upstream code repository is Apache-2.0. Please consult and cite the original: | |
| - Model: https://huggingface.co/zai-org/SCAIL-2 | |
| - Code: https://github.com/zai-org/SCAIL-2 | |
| - Paper: *SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning* (arXiv:2606.10804) | |
| - Built on Wan2.1 (Alibaba Wan team), UMT5-XXL, and OpenCLIP. | |
| All credit for the model belongs to the original authors. This repo exists solely to make SCAIL-2 usable in SceneWorks on Apple Silicon. | |