Upload folder using huggingface_hub

5889ff6 verified about 19 hours ago

3.96 kB

	---
	license: mit
	library_name: mlx
	pipeline_tag: text-to-video
	tags:
	- mlx
	- apple-silicon
	- video-generation
	- text-to-video
	- image-to-video
	- video-continuation
	- longcat
	- flow-matching
	- block-sparse-attention
	- quantized
	- 4-bit
	base_model:
	- mlx-community/LongCat-Video-bf16
	language:
	- en
	- zh
	---

	Part of the [LongCat-Video — MLX](https://huggingface.co/collections/mlx-community/longcat-video-mlx-6a216a3576c098e83c1cc167) collection.


	# LongCat-Video-q4 (MLX)

	4-bit quantized variant of [mlx-community/LongCat-Video-bf16](https://huggingface.co/mlx-community/LongCat-Video-bf16).
	Same model, same six task variants (T2V / I2V / Continuation / Refinement / Long-Video / Interactive),
	same `cfg_step_lora` + `refinement_lora` files — just with the DiT Linears
	quantized to 4-bit via `mlx.nn.quantize` for smaller-RAM Macs.

	## TL;DR

	\| \| \|
	\|---\|---\|
	\| DiT \| 4-bit quantized (`group_size=64`, skip `final_layer.linear` + embedders + AdaLN) \|
	\| DiT size \| ~9 GB (2 shards; 2.85× smaller than bf16's 26 GB) \|
	\| VAE / umT5 / LoRAs \| bf16 (unchanged from bf16-variant) \|
	\| Total disk \| ~25 GB (vs 42 GB bf16) \|
	\| Min unified memory \| ~32 GB recommended for 480p \|
	\| Inference \| 50-step baseline OR 8-step with `cfg_step_lora` (fast) \|
	\| License \| MIT \|

	## Quantization details

	- Method: `mlx.nn.quantize(bits=4, group_size=64)` — MLX-LM convention
	- What's quantized: every `nn.Linear` in the 48-block DiT EXCEPT the
	skip patterns below
	- Skip patterns (kept at bf16):
	- `final_layer.linear` — Meituan's documented skip
	- `t_embedder.` — TimestepEmbedder MLP (small + sensitive; feeds
	`adaLN_modulation` which would otherwise corrupt — see L42 in
	[skill-lessons.md](https://github.com/xocialize/longcat-video-mlx/blob/main/docs/development/skill-lessons.md))
	- `y_embedder.` — CaptionEmbedder MLP (small + sensitive)
	- `adaLN_modulation.` — per-block AdaLN-Zero modulation (**must stay
	floating-point** — silent accumulation bug if quantized, L11)
	- What's NOT quantized: VAE, umT5, both LoRAs — they're small
	contributors to total disk and quantizing them would degrade output
	more than save space.

	The runtime pipeline (`longcat_video` package) auto-detects the
	`quantization` block in `dit/config.json` and applies `nn.quantize`
	before `load_weights`. No user-facing API change vs. the bf16 variant.

	## Quick start

	```bash
	# 1. Pull weights (~25 GB)
	hf download mlx-community/LongCat-Video-q4 --local-dir ./weights

	# 2. Set up inference (Python 3.12)
	git clone https://github.com/xocialize/longcat-video-mlx
	cd longcat-video-mlx
	python3.12 -m venv .venv
	.venv/bin/pip install -e ".[parity]"

	# 3. Run text-to-video — pass --variant q4
	.venv/bin/python scripts/run_t2v.py \
	--weights ./weights/.. \
	--variant q4 \
	--prompt "A cat surfing on a wave at sunset, cinematic, 8k" \
	--num-frames 93 \
	--out output_t2v.mp4

	# 4. Fast mode: --variant q4 --cfg-step-lora reduces 50 steps → 8
	.venv/bin/python scripts/run_t2v.py \
	--weights ./weights/.. \
	--variant q4 --cfg-step-lora \
	--prompt "A cat surfing on a wave at sunset..." \
	--num-frames 93 \
	--out output_t2v_fast.mp4
	```

	## Choosing between bf16, q4, q8

	\| Variant \| Disk \| Min RAM \| Quality \| Pick when \|
	\|---\|---\|---\|---\|---\|
	\| bf16 \| 42 GB \| 64 GB \| reference \| You want the best output and have the RAM headroom \|
	\| q4 \| 25 GB \| 32 GB \| minor degradation \| RAM is tight; you'd rather have q4 than not run at all \|
	\| q8 \| 30 GB \| 48 GB \| very close to bf16 \| Best of both — small disk savings, near-bf16 quality \|

	For batch generation / API serving, bf16 is the right choice —
	quality regression compounds. For exploration / personal use on a
	32–64 GB Mac, q4 is the sweet spot.

	## License

	MIT — matches the upstream
	[LongCat-Video](https://github.com/meituan-longcat/LongCat-Video) license.