--- license: mit library_name: mlx pipeline_tag: text-to-video tags: - mlx - apple-silicon - video-generation - text-to-video - image-to-video - video-continuation - longcat - flow-matching - block-sparse-attention base_model: - meituan-longcat/LongCat-Video language: - en - zh --- Part of the [LongCat-Video — MLX](https://huggingface.co/collections/mlx-community/longcat-video-mlx) collection. # LongCat-Video-bf16 (MLX) Apple MLX bf16 weights for [LongCat-Video](https://github.com/meituan-longcat/LongCat-Video) — Meituan's 13.6 B-parameter base text/image-to-video diffusion model — with the **`cfg_step_lora` and `refinement_lora` published as separate files** for runtime task switching. The same DiT checkpoint serves all six task variants: | Variant | Pipeline | LoRAs used | |---|---|---| | **T2V** (text-to-video) | `pipeline_t2v` | none (baseline) or `cfg_step_lora` (fast) | | **I2V** (image-to-video) | `pipeline_i2v` | same | | **Video Continuation** | `pipeline_continuation` | same | | **720p / 30fps refinement** | `refinement.py` | `refinement_lora` + Block Sparse Attention | | **Long-Video** | (chained Continuation) | same as Continuation | | **Interactive Video** | (per-segment T2V/Continuation) | same | For the companion audio-driven Avatar 1.5 port (built from the same DiT architecture + audio cross-attention overlay), see [mlx-community/LongCat-Video-Avatar-1.5-bf16](https://huggingface.co/mlx-community/LongCat-Video-Avatar-1.5-bf16). ## TL;DR | | | |---|---| | **Architecture** | Wan 2.1 VAE + umT5-XXL + 48-block base DiT + 2 LoRAs | | **Params** | ~13.6 B DiT + ~11 B umT5 + 0.5 B VAE + 2 × ~0.6 B LoRA | | **Format** | bf16, sharded safetensors (HF-style per-component subdirs) | | **Disk** | ~42 GB total (26 GB DiT + 11 GB umT5 + 5.3 GB LoRAs + 242 MB VAE) | | **Hardware** | Apple Silicon M-series, 64 GB+ unified memory recommended for 480p | | **Inference** | 50-step baseline OR ~8-step with `cfg_step_lora` (fast); refinement adds 720p/30fps SDEdit pass | | **License** | MIT (matches upstream Meituan) | ## Quick start ```bash # 1. Pull weights (~42 GB) hf download mlx-community/LongCat-Video-bf16 \ --local-dir ./weights # 2. Set up inference (Python 3.12) git clone https://github.com/xocialize/longcat-video-mlx cd longcat-video-mlx python3.12 -m venv .venv .venv/bin/pip install -e ".[parity]" # 3. Run text-to-video at 480p / 15fps .venv/bin/python scripts/run_t2v.py \ --weights ./weights/.. \ --prompt "A cat surfing on a wave at sunset, cinematic, 8k" \ --num-frames 93 \ --out output_t2v.mp4 # 4. (Optional) Refinement pass to 720p / 30fps .venv/bin/python scripts/run_refine.py \ --weights ./weights/.. \ --stage1 output_t2v.npy \ --prompt "A cat surfing on a wave at sunset, cinematic, 8k" \ --out output_refined.mp4 ``` ## Six task variants from one DiT All six pipelines share the same 13.6 B DiT weights. The **conditioning input** and **LoRA stack** are what change: | Variant | Conditioning latent | LoRA stack | BSA | |---|---|---|---| | T2V | pure noise | (optional `cfg_step_lora`) | off | | I2V | 1 reference frame at head | (optional `cfg_step_lora`) | off | | Continuation | last N frames of prior clip | (optional `cfg_step_lora`) | off | | Refinement | partial-noise on VAE-encoded upsample of coarse output | `refinement_lora` | **on** | | Long-Video | chained Continuation segments | inherits | off | | Interactive | sequenced T2V/Continuation w/ per-segment prompts | inherits | off | ## Architecture This is the **base text-to-video** port. Differences from the Avatar overlay that the companion repo adds: - **No audio path** — no Whisper-Large-v3 encoder, no AudioProjModel, no audio cross-attention in DiT blocks - **No Reference Skip Attention** — base I2V uses the reference frame as a *motion anchor*, not a persistent identity, so the Avatar-specific Q-slicing is not used here - **Standard text-CFG** (2-pass) — vs Avatar's 3-pass disentangled CFG - **`scheduler_shift = 12.0`** — vs Avatar's 7.0 - **Block Sparse Attention** — needed only by the 720p refinement pass (`enable_bsa: false` in the base DiT config; the refinement script flips it on along with hot-swapping `refinement_lora`) ### Block Sparse Attention details BSA params from the published config: ```json "bsa_params": { "sparsity": 0.9375, "chunk_3d_shape_q": [4, 4, 4], "chunk_3d_shape_k": [4, 4, 4] } ``` Tokens are grouped into 4×4×4 = 64-token blocks along the patchified (T_lat, H_lat, W_lat) grid. Sparsity 0.9375 keeps 6.25% of K/V blocks per Q block via top-k routing on block-level mean-pooled scores. This makes 720p attention tractable; without it the 720p second pass would be too expensive on Apple Silicon. (Tier A pure-MLX in this port is correctness- correct but not yet kernel-fast; Tier B Metal kernel is in progress.) ## Programmatic LoRA merge Each LoRA can be loaded separately for fine-grained control: ```python from longcat_video.pipeline_t2v import LongCatVideoT2VPipeline, T2VPipelineConfig from longcat_video.lora import compute_merged_delta, group_lora_tensors from safetensors import safe_open import mlx.core as mx pipeline = LongCatVideoT2VPipeline(...) # standard 3-component load # Merge cfg_step_lora for the fast path (8 steps, no CFG correction) lora_sd = {} with safe_open("weights/lora/cfg_step_lora.safetensors", framework="numpy") as f: for k in f.keys(): lora_sd[k] = mx.array(f.get_tensor(k)) # (LoRA merge helper covers both cfg_step_lora and refinement_lora — # load whichever path your variant uses.) ``` ## License MIT — matches the upstream [LongCat-Video](https://github.com/meituan-longcat/LongCat-Video) license. Use of the model implies compliance with the upstream's responsible-use guidelines (no generation of harmful, defamatory, or non-consensual content). ## Acknowledgements - [Meituan LongCat team](https://github.com/meituan-longcat) — original PT model + tech report - [ml-explore/mlx](https://github.com/ml-explore/mlx) — the framework - [mlx-community](https://huggingface.co/mlx-community) — collection home