Part of the LongCat-Video — MLX collection.

LongCat-Video-q4 (MLX)

4-bit quantized variant of mlx-community/LongCat-Video-bf16. Same model, same six task variants (T2V / I2V / Continuation / Refinement / Long-Video / Interactive), same cfg_step_lora + refinement_lora files — just with the DiT Linears quantized to 4-bit via mlx.nn.quantize for smaller-RAM Macs.

TL;DR


DiT	4-bit quantized (`group_size=64`, skip `final_layer.linear` + embedders + AdaLN)
DiT size	~9 GB (2 shards; 2.85× smaller than bf16's 26 GB)
VAE / umT5 / LoRAs	bf16 (unchanged from bf16-variant)
Total disk	~25 GB (vs 42 GB bf16)
Min unified memory	~32 GB recommended for 480p
Inference	50-step baseline OR 8-step with `cfg_step_lora` (fast)
License	MIT

Quantization details

Method: mlx.nn.quantize(bits=4, group_size=64) — MLX-LM convention
What's quantized: every nn.Linear in the 48-block DiT EXCEPT the skip patterns below
Skip patterns (kept at bf16):
- final_layer.linear — Meituan's documented skip
- t_embedder. — TimestepEmbedder MLP (small + sensitive; feeds adaLN_modulation which would otherwise corrupt — see L42 in skill-lessons.md)
- y_embedder. — CaptionEmbedder MLP (small + sensitive)
- adaLN_modulation. — per-block AdaLN-Zero modulation (must stay floating-point — silent accumulation bug if quantized, L11)
What's NOT quantized: VAE, umT5, both LoRAs — they're small contributors to total disk and quantizing them would degrade output more than save space.

The runtime pipeline (longcat_video package) auto-detects the quantization block in dit/config.json and applies nn.quantize before load_weights. No user-facing API change vs. the bf16 variant.

Quick start

# 1. Pull weights (~25 GB)
hf download mlx-community/LongCat-Video-q4 --local-dir ./weights

# 2. Set up inference (Python 3.12)
git clone https://github.com/xocialize/longcat-video-mlx
cd longcat-video-mlx
python3.12 -m venv .venv
.venv/bin/pip install -e ".[parity]"

# 3. Run text-to-video — pass --variant q4
.venv/bin/python scripts/run_t2v.py \
    --weights ./weights/.. \
    --variant q4 \
    --prompt "A cat surfing on a wave at sunset, cinematic, 8k" \
    --num-frames 93 \
    --out output_t2v.mp4

# 4. Fast mode: --variant q4 --cfg-step-lora reduces 50 steps → 8
.venv/bin/python scripts/run_t2v.py \
    --weights ./weights/.. \
    --variant q4 --cfg-step-lora \
    --prompt "A cat surfing on a wave at sunset..." \
    --num-frames 93 \
    --out output_t2v_fast.mp4

Choosing between bf16, q4, q8

Variant	Disk	Min RAM	Quality	Pick when
bf16	42 GB	64 GB	reference	You want the best output and have the RAM headroom
q4	25 GB	32 GB	minor degradation	RAM is tight; you'd rather have q4 than not run at all
q8	30 GB	48 GB	very close to bf16	Best of both — small disk savings, near-bf16 quality

For batch generation / API serving, bf16 is the right choice — quality regression compounds. For exploration / personal use on a 32–64 GB Mac, q4 is the sweet spot.