mlx-community
/

LongCat-Video-q8

video-generation

video-continuation

block-sparse-attention

8-bit precision

Model card Files Files and versions

LongCat-Video-q8 / README.md

xocialize's picture

Upload folder using huggingface_hub

b167c8c verified about 23 hours ago

|

history blame contribute delete

2.8 kB

	---
	license: mit
	library_name: mlx
	pipeline_tag: text-to-video
	tags:
	- mlx
	- apple-silicon
	- video-generation
	- text-to-video
	- image-to-video
	- video-continuation
	- longcat
	- flow-matching
	- block-sparse-attention
	- quantized
	- 8-bit
	base_model:
	- mlx-community/LongCat-Video-bf16
	language:
	- en
	- zh
	---

	Part of the [LongCat-Video — MLX](https://huggingface.co/collections/mlx-community/longcat-video-mlx-6a216a3576c098e83c1cc167) collection.


	# LongCat-Video-q8 (MLX)

	8-bit quantized variant of [mlx-community/LongCat-Video-bf16](https://huggingface.co/mlx-community/LongCat-Video-bf16).
	Same model, same six task variants (T2V / I2V / Continuation / Refinement / Long-Video / Interactive),
	same `cfg_step_lora` + `refinement_lora` files — just with the DiT Linears
	quantized to 8-bit via `mlx.nn.quantize`.

	The 8-bit variant trades a small disk-savings improvement (vs 4-bit) for
	near-bf16 quality. If you have the RAM headroom for 30 GB but not 42 GB,
	q8 is the right pick.

	## TL;DR

	\| \| \|
	\|---\|---\|
	\| DiT \| 8-bit quantized (`group_size=64`, skip `final_layer.linear` + embedders + AdaLN) \|
	\| DiT size \| ~15 GB (4 shards; 1.7× smaller than bf16's 26 GB) \|
	\| VAE / umT5 / LoRAs \| bf16 (unchanged from bf16-variant) \|
	\| Total disk \| ~31 GB (vs 42 GB bf16) \|
	\| Min unified memory \| ~48 GB recommended for 480p \|
	\| Inference \| 50-step baseline OR 8-step with `cfg_step_lora` (fast) \|
	\| License \| MIT \|

	## Quantization details

	Same skip pattern as q4 — see the q4 card for full notes on why each
	pattern is excluded (L11 + L42 in the
	[skill-lessons](https://github.com/xocialize/longcat-video-mlx/blob/main/docs/development/skill-lessons.md)).

	The only difference vs q4 is `bits=8` in the `quantization` config block.

	## Quick start

	```bash
	# 1. Pull weights (~31 GB)
	hf download mlx-community/LongCat-Video-q8 --local-dir ./weights

	# 2. Set up inference
	git clone https://github.com/xocialize/longcat-video-mlx
	cd longcat-video-mlx
	python3.12 -m venv .venv
	.venv/bin/pip install -e ".[parity]"

	# 3. Run text-to-video — pass --variant q8
	.venv/bin/python scripts/run_t2v.py \
	--weights ./weights/.. \
	--variant q8 \
	--prompt "A cat surfing on a wave at sunset, cinematic, 8k" \
	--num-frames 93 \
	--out output_t2v.mp4
	```

	## Choosing between bf16, q4, q8

	\| Variant \| Disk \| Min RAM \| Quality \| Pick when \|
	\|---\|---\|---\|---\|---\|
	\| bf16 \| 42 GB \| 64 GB \| reference \| Best output, you have the RAM headroom \|
	\| q4 \| 25 GB \| 32 GB \| minor degradation \| RAM is tight (32 GB Mac) \|
	\| q8 \| 30 GB \| 48 GB \| very close to bf16 \| Best balance — small savings, near-bf16 quality \|

	## License

	MIT — matches the upstream
	[LongCat-Video](https://github.com/meituan-longcat/LongCat-Video) license.