File size: 6,176 Bytes
fe28193
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
license: mit
library_name: mlx
pipeline_tag: text-to-video
tags:
  - mlx
  - apple-silicon
  - video-generation
  - text-to-video
  - image-to-video
  - video-continuation
  - longcat
  - flow-matching
  - block-sparse-attention
base_model:
  - meituan-longcat/LongCat-Video
language:
  - en
  - zh
---

Part of the [LongCat-Video β€” MLX](https://huggingface.co/collections/mlx-community/longcat-video-mlx) collection.


# LongCat-Video-bf16 (MLX)

Apple MLX bf16 weights for [LongCat-Video](https://github.com/meituan-longcat/LongCat-Video) β€”
Meituan's 13.6 B-parameter base text/image-to-video diffusion model β€” with the
**`cfg_step_lora` and `refinement_lora` published as separate files** for
runtime task switching.

The same DiT checkpoint serves all six task variants:

| Variant | Pipeline | LoRAs used |
|---|---|---|
| **T2V** (text-to-video) | `pipeline_t2v` | none (baseline) or `cfg_step_lora` (fast) |
| **I2V** (image-to-video) | `pipeline_i2v` | same |
| **Video Continuation** | `pipeline_continuation` | same |
| **720p / 30fps refinement** | `refinement.py` | `refinement_lora` + Block Sparse Attention |
| **Long-Video** | (chained Continuation) | same as Continuation |
| **Interactive Video** | (per-segment T2V/Continuation) | same |

For the companion audio-driven Avatar 1.5 port (built from the same DiT
architecture + audio cross-attention overlay), see
[mlx-community/LongCat-Video-Avatar-1.5-bf16](https://huggingface.co/mlx-community/LongCat-Video-Avatar-1.5-bf16).

## TL;DR

| | |
|---|---|
| **Architecture** | Wan 2.1 VAE + umT5-XXL + 48-block base DiT + 2 LoRAs |
| **Params** | ~13.6 B DiT + ~11 B umT5 + 0.5 B VAE + 2 Γ— ~0.6 B LoRA |
| **Format** | bf16, sharded safetensors (HF-style per-component subdirs) |
| **Disk** | ~42 GB total (26 GB DiT + 11 GB umT5 + 5.3 GB LoRAs + 242 MB VAE) |
| **Hardware** | Apple Silicon M-series, 64 GB+ unified memory recommended for 480p |
| **Inference** | 50-step baseline OR ~8-step with `cfg_step_lora` (fast); refinement adds 720p/30fps SDEdit pass |
| **License** | MIT (matches upstream Meituan) |

## Quick start

```bash
# 1. Pull weights (~42 GB)
hf download mlx-community/LongCat-Video-bf16 \
    --local-dir ./weights

# 2. Set up inference (Python 3.12)
git clone https://github.com/xocialize/longcat-video-mlx
cd longcat-video-mlx
python3.12 -m venv .venv
.venv/bin/pip install -e ".[parity]"

# 3. Run text-to-video at 480p / 15fps
.venv/bin/python scripts/run_t2v.py \
    --weights ./weights/.. \
    --prompt "A cat surfing on a wave at sunset, cinematic, 8k" \
    --num-frames 93 \
    --out output_t2v.mp4

# 4. (Optional) Refinement pass to 720p / 30fps
.venv/bin/python scripts/run_refine.py \
    --weights ./weights/.. \
    --stage1 output_t2v.npy \
    --prompt "A cat surfing on a wave at sunset, cinematic, 8k" \
    --out output_refined.mp4
```

## Six task variants from one DiT

All six pipelines share the same 13.6 B DiT weights. The **conditioning input**
and **LoRA stack** are what change:

| Variant | Conditioning latent | LoRA stack | BSA |
|---|---|---|---|
| T2V | pure noise | (optional `cfg_step_lora`) | off |
| I2V | 1 reference frame at head | (optional `cfg_step_lora`) | off |
| Continuation | last N frames of prior clip | (optional `cfg_step_lora`) | off |
| Refinement | partial-noise on VAE-encoded upsample of coarse output | `refinement_lora` | **on** |
| Long-Video | chained Continuation segments | inherits | off |
| Interactive | sequenced T2V/Continuation w/ per-segment prompts | inherits | off |

## Architecture

This is the **base text-to-video** port. Differences from the Avatar overlay
that the companion repo adds:

- **No audio path** β€” no Whisper-Large-v3 encoder, no AudioProjModel, no
  audio cross-attention in DiT blocks
- **No Reference Skip Attention** β€” base I2V uses the reference frame as a
  *motion anchor*, not a persistent identity, so the Avatar-specific Q-slicing
  is not used here
- **Standard text-CFG** (2-pass) β€” vs Avatar's 3-pass disentangled CFG
- **`scheduler_shift = 12.0`** β€” vs Avatar's 7.0
- **Block Sparse Attention** β€” needed only by the 720p refinement pass
  (`enable_bsa: false` in the base DiT config; the refinement script flips
  it on along with hot-swapping `refinement_lora`)

### Block Sparse Attention details

BSA params from the published config:

```json
"bsa_params": {
  "sparsity": 0.9375,
  "chunk_3d_shape_q": [4, 4, 4],
  "chunk_3d_shape_k": [4, 4, 4]
}
```

Tokens are grouped into 4Γ—4Γ—4 = 64-token blocks along the patchified
(T_lat, H_lat, W_lat) grid. Sparsity 0.9375 keeps 6.25% of K/V blocks per
Q block via top-k routing on block-level mean-pooled scores. This makes
720p attention tractable; without it the 720p second pass would be too
expensive on Apple Silicon. (Tier A pure-MLX in this port is correctness-
correct but not yet kernel-fast; Tier B Metal kernel is in progress.)

## Programmatic LoRA merge

Each LoRA can be loaded separately for fine-grained control:

```python
from longcat_video.pipeline_t2v import LongCatVideoT2VPipeline, T2VPipelineConfig
from longcat_video.lora import compute_merged_delta, group_lora_tensors
from safetensors import safe_open
import mlx.core as mx

pipeline = LongCatVideoT2VPipeline(...)   # standard 3-component load

# Merge cfg_step_lora for the fast path (8 steps, no CFG correction)
lora_sd = {}
with safe_open("weights/lora/cfg_step_lora.safetensors", framework="numpy") as f:
    for k in f.keys():
        lora_sd[k] = mx.array(f.get_tensor(k))

# (LoRA merge helper covers both cfg_step_lora and refinement_lora β€”
# load whichever path your variant uses.)
```

## License

MIT β€” matches the upstream [LongCat-Video](https://github.com/meituan-longcat/LongCat-Video)
license. Use of the model implies compliance with the upstream's responsible-use
guidelines (no generation of harmful, defamatory, or non-consensual content).

## Acknowledgements

- [Meituan LongCat team](https://github.com/meituan-longcat) β€” original PT
  model + tech report
- [ml-explore/mlx](https://github.com/ml-explore/mlx) β€” the framework
- [mlx-community](https://huggingface.co/mlx-community) β€” collection home