LongCat-Video-bf16 / README.md
xocialize's picture
Upload folder using huggingface_hub
fe28193 verified
---
license: mit
library_name: mlx
pipeline_tag: text-to-video
tags:
- mlx
- apple-silicon
- video-generation
- text-to-video
- image-to-video
- video-continuation
- longcat
- flow-matching
- block-sparse-attention
base_model:
- meituan-longcat/LongCat-Video
language:
- en
- zh
---
Part of the [LongCat-Video β€” MLX](https://huggingface.co/collections/mlx-community/longcat-video-mlx) collection.
# LongCat-Video-bf16 (MLX)
Apple MLX bf16 weights for [LongCat-Video](https://github.com/meituan-longcat/LongCat-Video) β€”
Meituan's 13.6 B-parameter base text/image-to-video diffusion model β€” with the
**`cfg_step_lora` and `refinement_lora` published as separate files** for
runtime task switching.
The same DiT checkpoint serves all six task variants:
| Variant | Pipeline | LoRAs used |
|---|---|---|
| **T2V** (text-to-video) | `pipeline_t2v` | none (baseline) or `cfg_step_lora` (fast) |
| **I2V** (image-to-video) | `pipeline_i2v` | same |
| **Video Continuation** | `pipeline_continuation` | same |
| **720p / 30fps refinement** | `refinement.py` | `refinement_lora` + Block Sparse Attention |
| **Long-Video** | (chained Continuation) | same as Continuation |
| **Interactive Video** | (per-segment T2V/Continuation) | same |
For the companion audio-driven Avatar 1.5 port (built from the same DiT
architecture + audio cross-attention overlay), see
[mlx-community/LongCat-Video-Avatar-1.5-bf16](https://huggingface.co/mlx-community/LongCat-Video-Avatar-1.5-bf16).
## TL;DR
| | |
|---|---|
| **Architecture** | Wan 2.1 VAE + umT5-XXL + 48-block base DiT + 2 LoRAs |
| **Params** | ~13.6 B DiT + ~11 B umT5 + 0.5 B VAE + 2 Γ— ~0.6 B LoRA |
| **Format** | bf16, sharded safetensors (HF-style per-component subdirs) |
| **Disk** | ~42 GB total (26 GB DiT + 11 GB umT5 + 5.3 GB LoRAs + 242 MB VAE) |
| **Hardware** | Apple Silicon M-series, 64 GB+ unified memory recommended for 480p |
| **Inference** | 50-step baseline OR ~8-step with `cfg_step_lora` (fast); refinement adds 720p/30fps SDEdit pass |
| **License** | MIT (matches upstream Meituan) |
## Quick start
```bash
# 1. Pull weights (~42 GB)
hf download mlx-community/LongCat-Video-bf16 \
--local-dir ./weights
# 2. Set up inference (Python 3.12)
git clone https://github.com/xocialize/longcat-video-mlx
cd longcat-video-mlx
python3.12 -m venv .venv
.venv/bin/pip install -e ".[parity]"
# 3. Run text-to-video at 480p / 15fps
.venv/bin/python scripts/run_t2v.py \
--weights ./weights/.. \
--prompt "A cat surfing on a wave at sunset, cinematic, 8k" \
--num-frames 93 \
--out output_t2v.mp4
# 4. (Optional) Refinement pass to 720p / 30fps
.venv/bin/python scripts/run_refine.py \
--weights ./weights/.. \
--stage1 output_t2v.npy \
--prompt "A cat surfing on a wave at sunset, cinematic, 8k" \
--out output_refined.mp4
```
## Six task variants from one DiT
All six pipelines share the same 13.6 B DiT weights. The **conditioning input**
and **LoRA stack** are what change:
| Variant | Conditioning latent | LoRA stack | BSA |
|---|---|---|---|
| T2V | pure noise | (optional `cfg_step_lora`) | off |
| I2V | 1 reference frame at head | (optional `cfg_step_lora`) | off |
| Continuation | last N frames of prior clip | (optional `cfg_step_lora`) | off |
| Refinement | partial-noise on VAE-encoded upsample of coarse output | `refinement_lora` | **on** |
| Long-Video | chained Continuation segments | inherits | off |
| Interactive | sequenced T2V/Continuation w/ per-segment prompts | inherits | off |
## Architecture
This is the **base text-to-video** port. Differences from the Avatar overlay
that the companion repo adds:
- **No audio path** β€” no Whisper-Large-v3 encoder, no AudioProjModel, no
audio cross-attention in DiT blocks
- **No Reference Skip Attention** β€” base I2V uses the reference frame as a
*motion anchor*, not a persistent identity, so the Avatar-specific Q-slicing
is not used here
- **Standard text-CFG** (2-pass) β€” vs Avatar's 3-pass disentangled CFG
- **`scheduler_shift = 12.0`** β€” vs Avatar's 7.0
- **Block Sparse Attention** β€” needed only by the 720p refinement pass
(`enable_bsa: false` in the base DiT config; the refinement script flips
it on along with hot-swapping `refinement_lora`)
### Block Sparse Attention details
BSA params from the published config:
```json
"bsa_params": {
"sparsity": 0.9375,
"chunk_3d_shape_q": [4, 4, 4],
"chunk_3d_shape_k": [4, 4, 4]
}
```
Tokens are grouped into 4Γ—4Γ—4 = 64-token blocks along the patchified
(T_lat, H_lat, W_lat) grid. Sparsity 0.9375 keeps 6.25% of K/V blocks per
Q block via top-k routing on block-level mean-pooled scores. This makes
720p attention tractable; without it the 720p second pass would be too
expensive on Apple Silicon. (Tier A pure-MLX in this port is correctness-
correct but not yet kernel-fast; Tier B Metal kernel is in progress.)
## Programmatic LoRA merge
Each LoRA can be loaded separately for fine-grained control:
```python
from longcat_video.pipeline_t2v import LongCatVideoT2VPipeline, T2VPipelineConfig
from longcat_video.lora import compute_merged_delta, group_lora_tensors
from safetensors import safe_open
import mlx.core as mx
pipeline = LongCatVideoT2VPipeline(...) # standard 3-component load
# Merge cfg_step_lora for the fast path (8 steps, no CFG correction)
lora_sd = {}
with safe_open("weights/lora/cfg_step_lora.safetensors", framework="numpy") as f:
for k in f.keys():
lora_sd[k] = mx.array(f.get_tensor(k))
# (LoRA merge helper covers both cfg_step_lora and refinement_lora β€”
# load whichever path your variant uses.)
```
## License
MIT β€” matches the upstream [LongCat-Video](https://github.com/meituan-longcat/LongCat-Video)
license. Use of the model implies compliance with the upstream's responsible-use
guidelines (no generation of harmful, defamatory, or non-consensual content).
## Acknowledgements
- [Meituan LongCat team](https://github.com/meituan-longcat) β€” original PT
model + tech report
- [ml-explore/mlx](https://github.com/ml-explore/mlx) β€” the framework
- [mlx-community](https://huggingface.co/mlx-community) β€” collection home