Text-to-Video
MLX
Diffusers
Safetensors
English
Chinese
apple-silicon
video-generation
image-to-video
video-continuation
longcat
flow-matching
block-sparse-attention
Instructions to use mlx-community/LongCat-Video-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/LongCat-Video-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir LongCat-Video-bf16 mlx-community/LongCat-Video-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
| license: mit | |
| library_name: mlx | |
| pipeline_tag: text-to-video | |
| tags: | |
| - mlx | |
| - apple-silicon | |
| - video-generation | |
| - text-to-video | |
| - image-to-video | |
| - video-continuation | |
| - longcat | |
| - flow-matching | |
| - block-sparse-attention | |
| base_model: | |
| - meituan-longcat/LongCat-Video | |
| language: | |
| - en | |
| - zh | |
| Part of the [LongCat-Video β MLX](https://huggingface.co/collections/mlx-community/longcat-video-mlx) collection. | |
| # LongCat-Video-bf16 (MLX) | |
| Apple MLX bf16 weights for [LongCat-Video](https://github.com/meituan-longcat/LongCat-Video) β | |
| Meituan's 13.6 B-parameter base text/image-to-video diffusion model β with the | |
| **`cfg_step_lora` and `refinement_lora` published as separate files** for | |
| runtime task switching. | |
| The same DiT checkpoint serves all six task variants: | |
| | Variant | Pipeline | LoRAs used | | |
| |---|---|---| | |
| | **T2V** (text-to-video) | `pipeline_t2v` | none (baseline) or `cfg_step_lora` (fast) | | |
| | **I2V** (image-to-video) | `pipeline_i2v` | same | | |
| | **Video Continuation** | `pipeline_continuation` | same | | |
| | **720p / 30fps refinement** | `refinement.py` | `refinement_lora` + Block Sparse Attention | | |
| | **Long-Video** | (chained Continuation) | same as Continuation | | |
| | **Interactive Video** | (per-segment T2V/Continuation) | same | | |
| For the companion audio-driven Avatar 1.5 port (built from the same DiT | |
| architecture + audio cross-attention overlay), see | |
| [mlx-community/LongCat-Video-Avatar-1.5-bf16](https://huggingface.co/mlx-community/LongCat-Video-Avatar-1.5-bf16). | |
| ## TL;DR | |
| | | | | |
| |---|---| | |
| | **Architecture** | Wan 2.1 VAE + umT5-XXL + 48-block base DiT + 2 LoRAs | | |
| | **Params** | ~13.6 B DiT + ~11 B umT5 + 0.5 B VAE + 2 Γ ~0.6 B LoRA | | |
| | **Format** | bf16, sharded safetensors (HF-style per-component subdirs) | | |
| | **Disk** | ~42 GB total (26 GB DiT + 11 GB umT5 + 5.3 GB LoRAs + 242 MB VAE) | | |
| | **Hardware** | Apple Silicon M-series, 64 GB+ unified memory recommended for 480p | | |
| | **Inference** | 50-step baseline OR ~8-step with `cfg_step_lora` (fast); refinement adds 720p/30fps SDEdit pass | | |
| | **License** | MIT (matches upstream Meituan) | | |
| ## Quick start | |
| ```bash | |
| # 1. Pull weights (~42 GB) | |
| hf download mlx-community/LongCat-Video-bf16 \ | |
| --local-dir ./weights | |
| # 2. Set up inference (Python 3.12) | |
| git clone https://github.com/xocialize/longcat-video-mlx | |
| cd longcat-video-mlx | |
| python3.12 -m venv .venv | |
| .venv/bin/pip install -e ".[parity]" | |
| # 3. Run text-to-video at 480p / 15fps | |
| .venv/bin/python scripts/run_t2v.py \ | |
| --weights ./weights/.. \ | |
| --prompt "A cat surfing on a wave at sunset, cinematic, 8k" \ | |
| --num-frames 93 \ | |
| --out output_t2v.mp4 | |
| # 4. (Optional) Refinement pass to 720p / 30fps | |
| .venv/bin/python scripts/run_refine.py \ | |
| --weights ./weights/.. \ | |
| --stage1 output_t2v.npy \ | |
| --prompt "A cat surfing on a wave at sunset, cinematic, 8k" \ | |
| --out output_refined.mp4 | |
| ``` | |
| ## Six task variants from one DiT | |
| All six pipelines share the same 13.6 B DiT weights. The **conditioning input** | |
| and **LoRA stack** are what change: | |
| | Variant | Conditioning latent | LoRA stack | BSA | | |
| |---|---|---|---| | |
| | T2V | pure noise | (optional `cfg_step_lora`) | off | | |
| | I2V | 1 reference frame at head | (optional `cfg_step_lora`) | off | | |
| | Continuation | last N frames of prior clip | (optional `cfg_step_lora`) | off | | |
| | Refinement | partial-noise on VAE-encoded upsample of coarse output | `refinement_lora` | **on** | | |
| | Long-Video | chained Continuation segments | inherits | off | | |
| | Interactive | sequenced T2V/Continuation w/ per-segment prompts | inherits | off | | |
| ## Architecture | |
| This is the **base text-to-video** port. Differences from the Avatar overlay | |
| that the companion repo adds: | |
| - **No audio path** β no Whisper-Large-v3 encoder, no AudioProjModel, no | |
| audio cross-attention in DiT blocks | |
| - **No Reference Skip Attention** β base I2V uses the reference frame as a | |
| *motion anchor*, not a persistent identity, so the Avatar-specific Q-slicing | |
| is not used here | |
| - **Standard text-CFG** (2-pass) β vs Avatar's 3-pass disentangled CFG | |
| - **`scheduler_shift = 12.0`** β vs Avatar's 7.0 | |
| - **Block Sparse Attention** β needed only by the 720p refinement pass | |
| (`enable_bsa: false` in the base DiT config; the refinement script flips | |
| it on along with hot-swapping `refinement_lora`) | |
| ### Block Sparse Attention details | |
| BSA params from the published config: | |
| ```json | |
| "bsa_params": { | |
| "sparsity": 0.9375, | |
| "chunk_3d_shape_q": [4, 4, 4], | |
| "chunk_3d_shape_k": [4, 4, 4] | |
| } | |
| ``` | |
| Tokens are grouped into 4Γ4Γ4 = 64-token blocks along the patchified | |
| (T_lat, H_lat, W_lat) grid. Sparsity 0.9375 keeps 6.25% of K/V blocks per | |
| Q block via top-k routing on block-level mean-pooled scores. This makes | |
| 720p attention tractable; without it the 720p second pass would be too | |
| expensive on Apple Silicon. (Tier A pure-MLX in this port is correctness- | |
| correct but not yet kernel-fast; Tier B Metal kernel is in progress.) | |
| ## Programmatic LoRA merge | |
| Each LoRA can be loaded separately for fine-grained control: | |
| ```python | |
| from longcat_video.pipeline_t2v import LongCatVideoT2VPipeline, T2VPipelineConfig | |
| from longcat_video.lora import compute_merged_delta, group_lora_tensors | |
| from safetensors import safe_open | |
| import mlx.core as mx | |
| pipeline = LongCatVideoT2VPipeline(...) # standard 3-component load | |
| # Merge cfg_step_lora for the fast path (8 steps, no CFG correction) | |
| lora_sd = {} | |
| with safe_open("weights/lora/cfg_step_lora.safetensors", framework="numpy") as f: | |
| for k in f.keys(): | |
| lora_sd[k] = mx.array(f.get_tensor(k)) | |
| # (LoRA merge helper covers both cfg_step_lora and refinement_lora β | |
| # load whichever path your variant uses.) | |
| ``` | |
| ## License | |
| MIT β matches the upstream [LongCat-Video](https://github.com/meituan-longcat/LongCat-Video) | |
| license. Use of the model implies compliance with the upstream's responsible-use | |
| guidelines (no generation of harmful, defamatory, or non-consensual content). | |
| ## Acknowledgements | |
| - [Meituan LongCat team](https://github.com/meituan-longcat) β original PT | |
| model + tech report | |
| - [ml-explore/mlx](https://github.com/ml-explore/mlx) β the framework | |
| - [mlx-community](https://huggingface.co/mlx-community) β collection home | |