Text-to-Video
MLX
Diffusers
Safetensors
English
Chinese
apple-silicon
video-generation
image-to-video
video-continuation
longcat
flow-matching
block-sparse-attention
Instructions to use mlx-community/LongCat-Video-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/LongCat-Video-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir LongCat-Video-bf16 mlx-community/LongCat-Video-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
File size: 6,176 Bytes
fe28193 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | ---
license: mit
library_name: mlx
pipeline_tag: text-to-video
tags:
- mlx
- apple-silicon
- video-generation
- text-to-video
- image-to-video
- video-continuation
- longcat
- flow-matching
- block-sparse-attention
base_model:
- meituan-longcat/LongCat-Video
language:
- en
- zh
---
Part of the [LongCat-Video β MLX](https://huggingface.co/collections/mlx-community/longcat-video-mlx) collection.
# LongCat-Video-bf16 (MLX)
Apple MLX bf16 weights for [LongCat-Video](https://github.com/meituan-longcat/LongCat-Video) β
Meituan's 13.6 B-parameter base text/image-to-video diffusion model β with the
**`cfg_step_lora` and `refinement_lora` published as separate files** for
runtime task switching.
The same DiT checkpoint serves all six task variants:
| Variant | Pipeline | LoRAs used |
|---|---|---|
| **T2V** (text-to-video) | `pipeline_t2v` | none (baseline) or `cfg_step_lora` (fast) |
| **I2V** (image-to-video) | `pipeline_i2v` | same |
| **Video Continuation** | `pipeline_continuation` | same |
| **720p / 30fps refinement** | `refinement.py` | `refinement_lora` + Block Sparse Attention |
| **Long-Video** | (chained Continuation) | same as Continuation |
| **Interactive Video** | (per-segment T2V/Continuation) | same |
For the companion audio-driven Avatar 1.5 port (built from the same DiT
architecture + audio cross-attention overlay), see
[mlx-community/LongCat-Video-Avatar-1.5-bf16](https://huggingface.co/mlx-community/LongCat-Video-Avatar-1.5-bf16).
## TL;DR
| | |
|---|---|
| **Architecture** | Wan 2.1 VAE + umT5-XXL + 48-block base DiT + 2 LoRAs |
| **Params** | ~13.6 B DiT + ~11 B umT5 + 0.5 B VAE + 2 Γ ~0.6 B LoRA |
| **Format** | bf16, sharded safetensors (HF-style per-component subdirs) |
| **Disk** | ~42 GB total (26 GB DiT + 11 GB umT5 + 5.3 GB LoRAs + 242 MB VAE) |
| **Hardware** | Apple Silicon M-series, 64 GB+ unified memory recommended for 480p |
| **Inference** | 50-step baseline OR ~8-step with `cfg_step_lora` (fast); refinement adds 720p/30fps SDEdit pass |
| **License** | MIT (matches upstream Meituan) |
## Quick start
```bash
# 1. Pull weights (~42 GB)
hf download mlx-community/LongCat-Video-bf16 \
--local-dir ./weights
# 2. Set up inference (Python 3.12)
git clone https://github.com/xocialize/longcat-video-mlx
cd longcat-video-mlx
python3.12 -m venv .venv
.venv/bin/pip install -e ".[parity]"
# 3. Run text-to-video at 480p / 15fps
.venv/bin/python scripts/run_t2v.py \
--weights ./weights/.. \
--prompt "A cat surfing on a wave at sunset, cinematic, 8k" \
--num-frames 93 \
--out output_t2v.mp4
# 4. (Optional) Refinement pass to 720p / 30fps
.venv/bin/python scripts/run_refine.py \
--weights ./weights/.. \
--stage1 output_t2v.npy \
--prompt "A cat surfing on a wave at sunset, cinematic, 8k" \
--out output_refined.mp4
```
## Six task variants from one DiT
All six pipelines share the same 13.6 B DiT weights. The **conditioning input**
and **LoRA stack** are what change:
| Variant | Conditioning latent | LoRA stack | BSA |
|---|---|---|---|
| T2V | pure noise | (optional `cfg_step_lora`) | off |
| I2V | 1 reference frame at head | (optional `cfg_step_lora`) | off |
| Continuation | last N frames of prior clip | (optional `cfg_step_lora`) | off |
| Refinement | partial-noise on VAE-encoded upsample of coarse output | `refinement_lora` | **on** |
| Long-Video | chained Continuation segments | inherits | off |
| Interactive | sequenced T2V/Continuation w/ per-segment prompts | inherits | off |
## Architecture
This is the **base text-to-video** port. Differences from the Avatar overlay
that the companion repo adds:
- **No audio path** β no Whisper-Large-v3 encoder, no AudioProjModel, no
audio cross-attention in DiT blocks
- **No Reference Skip Attention** β base I2V uses the reference frame as a
*motion anchor*, not a persistent identity, so the Avatar-specific Q-slicing
is not used here
- **Standard text-CFG** (2-pass) β vs Avatar's 3-pass disentangled CFG
- **`scheduler_shift = 12.0`** β vs Avatar's 7.0
- **Block Sparse Attention** β needed only by the 720p refinement pass
(`enable_bsa: false` in the base DiT config; the refinement script flips
it on along with hot-swapping `refinement_lora`)
### Block Sparse Attention details
BSA params from the published config:
```json
"bsa_params": {
"sparsity": 0.9375,
"chunk_3d_shape_q": [4, 4, 4],
"chunk_3d_shape_k": [4, 4, 4]
}
```
Tokens are grouped into 4Γ4Γ4 = 64-token blocks along the patchified
(T_lat, H_lat, W_lat) grid. Sparsity 0.9375 keeps 6.25% of K/V blocks per
Q block via top-k routing on block-level mean-pooled scores. This makes
720p attention tractable; without it the 720p second pass would be too
expensive on Apple Silicon. (Tier A pure-MLX in this port is correctness-
correct but not yet kernel-fast; Tier B Metal kernel is in progress.)
## Programmatic LoRA merge
Each LoRA can be loaded separately for fine-grained control:
```python
from longcat_video.pipeline_t2v import LongCatVideoT2VPipeline, T2VPipelineConfig
from longcat_video.lora import compute_merged_delta, group_lora_tensors
from safetensors import safe_open
import mlx.core as mx
pipeline = LongCatVideoT2VPipeline(...) # standard 3-component load
# Merge cfg_step_lora for the fast path (8 steps, no CFG correction)
lora_sd = {}
with safe_open("weights/lora/cfg_step_lora.safetensors", framework="numpy") as f:
for k in f.keys():
lora_sd[k] = mx.array(f.get_tensor(k))
# (LoRA merge helper covers both cfg_step_lora and refinement_lora β
# load whichever path your variant uses.)
```
## License
MIT β matches the upstream [LongCat-Video](https://github.com/meituan-longcat/LongCat-Video)
license. Use of the model implies compliance with the upstream's responsible-use
guidelines (no generation of harmful, defamatory, or non-consensual content).
## Acknowledgements
- [Meituan LongCat team](https://github.com/meituan-longcat) β original PT
model + tech report
- [ml-explore/mlx](https://github.com/ml-explore/mlx) β the framework
- [mlx-community](https://huggingface.co/mlx-community) β collection home
|