LongCat-Video-q4 / README.md
xocialize's picture
Upload folder using huggingface_hub
5889ff6 verified
---
license: mit
library_name: mlx
pipeline_tag: text-to-video
tags:
- mlx
- apple-silicon
- video-generation
- text-to-video
- image-to-video
- video-continuation
- longcat
- flow-matching
- block-sparse-attention
- quantized
- 4-bit
base_model:
- mlx-community/LongCat-Video-bf16
language:
- en
- zh
---
Part of the [LongCat-Video β€” MLX](https://huggingface.co/collections/mlx-community/longcat-video-mlx-6a216a3576c098e83c1cc167) collection.
# LongCat-Video-q4 (MLX)
4-bit quantized variant of [mlx-community/LongCat-Video-bf16](https://huggingface.co/mlx-community/LongCat-Video-bf16).
Same model, same six task variants (T2V / I2V / Continuation / Refinement / Long-Video / Interactive),
same `cfg_step_lora` + `refinement_lora` files β€” just with the DiT Linears
quantized to 4-bit via `mlx.nn.quantize` for smaller-RAM Macs.
## TL;DR
| | |
|---|---|
| **DiT** | 4-bit quantized (`group_size=64`, skip `final_layer.linear` + embedders + AdaLN) |
| **DiT size** | ~9 GB (2 shards; 2.85Γ— smaller than bf16's 26 GB) |
| **VAE / umT5 / LoRAs** | bf16 (unchanged from bf16-variant) |
| **Total disk** | ~25 GB (vs 42 GB bf16) |
| **Min unified memory** | ~32 GB recommended for 480p |
| **Inference** | 50-step baseline OR 8-step with `cfg_step_lora` (fast) |
| **License** | MIT |
## Quantization details
- **Method:** `mlx.nn.quantize(bits=4, group_size=64)` β€” MLX-LM convention
- **What's quantized:** every `nn.Linear` in the 48-block DiT EXCEPT the
skip patterns below
- **Skip patterns** (kept at bf16):
- `final_layer.linear` β€” Meituan's documented skip
- `t_embedder.` β€” TimestepEmbedder MLP (small + sensitive; feeds
`adaLN_modulation` which would otherwise corrupt β€” see L42 in
[skill-lessons.md](https://github.com/xocialize/longcat-video-mlx/blob/main/docs/development/skill-lessons.md))
- `y_embedder.` β€” CaptionEmbedder MLP (small + sensitive)
- `adaLN_modulation.` β€” per-block AdaLN-Zero modulation (**must stay
floating-point** β€” silent accumulation bug if quantized, L11)
- **What's NOT quantized:** VAE, umT5, both LoRAs β€” they're small
contributors to total disk and quantizing them would degrade output
more than save space.
The runtime pipeline (`longcat_video` package) auto-detects the
`quantization` block in `dit/config.json` and applies `nn.quantize`
*before* `load_weights`. No user-facing API change vs. the bf16 variant.
## Quick start
```bash
# 1. Pull weights (~25 GB)
hf download mlx-community/LongCat-Video-q4 --local-dir ./weights
# 2. Set up inference (Python 3.12)
git clone https://github.com/xocialize/longcat-video-mlx
cd longcat-video-mlx
python3.12 -m venv .venv
.venv/bin/pip install -e ".[parity]"
# 3. Run text-to-video β€” pass --variant q4
.venv/bin/python scripts/run_t2v.py \
--weights ./weights/.. \
--variant q4 \
--prompt "A cat surfing on a wave at sunset, cinematic, 8k" \
--num-frames 93 \
--out output_t2v.mp4
# 4. Fast mode: --variant q4 --cfg-step-lora reduces 50 steps β†’ 8
.venv/bin/python scripts/run_t2v.py \
--weights ./weights/.. \
--variant q4 --cfg-step-lora \
--prompt "A cat surfing on a wave at sunset..." \
--num-frames 93 \
--out output_t2v_fast.mp4
```
## Choosing between bf16, q4, q8
| Variant | Disk | Min RAM | Quality | Pick when |
|---|---|---|---|---|
| **bf16** | 42 GB | 64 GB | reference | You want the best output and have the RAM headroom |
| **q4** | 25 GB | 32 GB | minor degradation | RAM is tight; you'd rather have q4 than not run at all |
| **q8** | 30 GB | 48 GB | very close to bf16 | Best of both β€” small disk savings, near-bf16 quality |
For batch generation / API serving, **bf16 is the right choice** β€”
quality regression compounds. For exploration / personal use on a
32–64 GB Mac, **q4 is the sweet spot**.
## License
MIT β€” matches the upstream
[LongCat-Video](https://github.com/meituan-longcat/LongCat-Video) license.