--- license: mit library_name: mlx pipeline_tag: text-to-video tags: - mlx - apple-silicon - video-generation - text-to-video - image-to-video - video-continuation - longcat - flow-matching - block-sparse-attention - quantized - 4-bit base_model: - mlx-community/LongCat-Video-bf16 language: - en - zh --- Part of the [LongCat-Video — MLX](https://huggingface.co/collections/mlx-community/longcat-video-mlx-6a216a3576c098e83c1cc167) collection. # LongCat-Video-q4 (MLX) 4-bit quantized variant of [mlx-community/LongCat-Video-bf16](https://huggingface.co/mlx-community/LongCat-Video-bf16). Same model, same six task variants (T2V / I2V / Continuation / Refinement / Long-Video / Interactive), same `cfg_step_lora` + `refinement_lora` files — just with the DiT Linears quantized to 4-bit via `mlx.nn.quantize` for smaller-RAM Macs. ## TL;DR | | | |---|---| | **DiT** | 4-bit quantized (`group_size=64`, skip `final_layer.linear` + embedders + AdaLN) | | **DiT size** | ~9 GB (2 shards; 2.85× smaller than bf16's 26 GB) | | **VAE / umT5 / LoRAs** | bf16 (unchanged from bf16-variant) | | **Total disk** | ~25 GB (vs 42 GB bf16) | | **Min unified memory** | ~32 GB recommended for 480p | | **Inference** | 50-step baseline OR 8-step with `cfg_step_lora` (fast) | | **License** | MIT | ## Quantization details - **Method:** `mlx.nn.quantize(bits=4, group_size=64)` — MLX-LM convention - **What's quantized:** every `nn.Linear` in the 48-block DiT EXCEPT the skip patterns below - **Skip patterns** (kept at bf16): - `final_layer.linear` — Meituan's documented skip - `t_embedder.` — TimestepEmbedder MLP (small + sensitive; feeds `adaLN_modulation` which would otherwise corrupt — see L42 in [skill-lessons.md](https://github.com/xocialize/longcat-video-mlx/blob/main/docs/development/skill-lessons.md)) - `y_embedder.` — CaptionEmbedder MLP (small + sensitive) - `adaLN_modulation.` — per-block AdaLN-Zero modulation (**must stay floating-point** — silent accumulation bug if quantized, L11) - **What's NOT quantized:** VAE, umT5, both LoRAs — they're small contributors to total disk and quantizing them would degrade output more than save space. The runtime pipeline (`longcat_video` package) auto-detects the `quantization` block in `dit/config.json` and applies `nn.quantize` *before* `load_weights`. No user-facing API change vs. the bf16 variant. ## Quick start ```bash # 1. Pull weights (~25 GB) hf download mlx-community/LongCat-Video-q4 --local-dir ./weights # 2. Set up inference (Python 3.12) git clone https://github.com/xocialize/longcat-video-mlx cd longcat-video-mlx python3.12 -m venv .venv .venv/bin/pip install -e ".[parity]" # 3. Run text-to-video — pass --variant q4 .venv/bin/python scripts/run_t2v.py \ --weights ./weights/.. \ --variant q4 \ --prompt "A cat surfing on a wave at sunset, cinematic, 8k" \ --num-frames 93 \ --out output_t2v.mp4 # 4. Fast mode: --variant q4 --cfg-step-lora reduces 50 steps → 8 .venv/bin/python scripts/run_t2v.py \ --weights ./weights/.. \ --variant q4 --cfg-step-lora \ --prompt "A cat surfing on a wave at sunset..." \ --num-frames 93 \ --out output_t2v_fast.mp4 ``` ## Choosing between bf16, q4, q8 | Variant | Disk | Min RAM | Quality | Pick when | |---|---|---|---|---| | **bf16** | 42 GB | 64 GB | reference | You want the best output and have the RAM headroom | | **q4** | 25 GB | 32 GB | minor degradation | RAM is tight; you'd rather have q4 than not run at all | | **q8** | 30 GB | 48 GB | very close to bf16 | Best of both — small disk savings, near-bf16 quality | For batch generation / API serving, **bf16 is the right choice** — quality regression compounds. For exploration / personal use on a 32–64 GB Mac, **q4 is the sweet spot**. ## License MIT — matches the upstream [LongCat-Video](https://github.com/meituan-longcat/LongCat-Video) license.