Instructions to use mlx-community/LongCat-Video-q4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/LongCat-Video-q4 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir LongCat-Video-q4 mlx-community/LongCat-Video-q4
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Part of the LongCat-Video β MLX collection.
LongCat-Video-q4 (MLX)
4-bit quantized variant of mlx-community/LongCat-Video-bf16.
Same model, same six task variants (T2V / I2V / Continuation / Refinement / Long-Video / Interactive),
same cfg_step_lora + refinement_lora files β just with the DiT Linears
quantized to 4-bit via mlx.nn.quantize for smaller-RAM Macs.
TL;DR
| DiT | 4-bit quantized (group_size=64, skip final_layer.linear + embedders + AdaLN) |
| DiT size | ~9 GB (2 shards; 2.85Γ smaller than bf16's 26 GB) |
| VAE / umT5 / LoRAs | bf16 (unchanged from bf16-variant) |
| Total disk | ~25 GB (vs 42 GB bf16) |
| Min unified memory | ~32 GB recommended for 480p |
| Inference | 50-step baseline OR 8-step with cfg_step_lora (fast) |
| License | MIT |
Quantization details
- Method:
mlx.nn.quantize(bits=4, group_size=64)β MLX-LM convention - What's quantized: every
nn.Linearin the 48-block DiT EXCEPT the skip patterns below - Skip patterns (kept at bf16):
final_layer.linearβ Meituan's documented skipt_embedder.β TimestepEmbedder MLP (small + sensitive; feedsadaLN_modulationwhich would otherwise corrupt β see L42 in skill-lessons.md)y_embedder.β CaptionEmbedder MLP (small + sensitive)adaLN_modulation.β per-block AdaLN-Zero modulation (must stay floating-point β silent accumulation bug if quantized, L11)
- What's NOT quantized: VAE, umT5, both LoRAs β they're small contributors to total disk and quantizing them would degrade output more than save space.
The runtime pipeline (longcat_video package) auto-detects the
quantization block in dit/config.json and applies nn.quantize
before load_weights. No user-facing API change vs. the bf16 variant.
Quick start
# 1. Pull weights (~25 GB)
hf download mlx-community/LongCat-Video-q4 --local-dir ./weights
# 2. Set up inference (Python 3.12)
git clone https://github.com/xocialize/longcat-video-mlx
cd longcat-video-mlx
python3.12 -m venv .venv
.venv/bin/pip install -e ".[parity]"
# 3. Run text-to-video β pass --variant q4
.venv/bin/python scripts/run_t2v.py \
--weights ./weights/.. \
--variant q4 \
--prompt "A cat surfing on a wave at sunset, cinematic, 8k" \
--num-frames 93 \
--out output_t2v.mp4
# 4. Fast mode: --variant q4 --cfg-step-lora reduces 50 steps β 8
.venv/bin/python scripts/run_t2v.py \
--weights ./weights/.. \
--variant q4 --cfg-step-lora \
--prompt "A cat surfing on a wave at sunset..." \
--num-frames 93 \
--out output_t2v_fast.mp4
Choosing between bf16, q4, q8
| Variant | Disk | Min RAM | Quality | Pick when |
|---|---|---|---|---|
| bf16 | 42 GB | 64 GB | reference | You want the best output and have the RAM headroom |
| q4 | 25 GB | 32 GB | minor degradation | RAM is tight; you'd rather have q4 than not run at all |
| q8 | 30 GB | 48 GB | very close to bf16 | Best of both β small disk savings, near-bf16 quality |
For batch generation / API serving, bf16 is the right choice β quality regression compounds. For exploration / personal use on a 32β64 GB Mac, q4 is the sweet spot.
License
MIT β matches the upstream LongCat-Video license.
4-bit
Model tree for mlx-community/LongCat-Video-q4
Base model
meituan-longcat/LongCat-Video