TOTORONG
/

LongCat-Flash-3.5bits

Text Generation

4-bit precision

Model card Files Files and versions

TOTORONG commited on Sep 7, 2025

Commit

1927a97

·

verified ·

1 Parent(s): 8e1d242

Update README.md

Files changed (1) hide show

README.md +25 -0

README.md CHANGED Viewed

@@ -14,6 +14,31 @@ This model [TOTORONG/LongCat-Flash-3.5bits](https://huggingface.co/TOTORONG/Long
 converted to MLX format from [meituan-longcat/LongCat-Flash-Chat](https://huggingface.co/meituan-longcat/LongCat-Flash-Chat)
 using mlx-lm version **0.27.1**.
 ## Use with mlx
 ```bash

 converted to MLX format from [meituan-longcat/LongCat-Flash-Chat](https://huggingface.co/meituan-longcat/LongCat-Flash-Chat)
 using mlx-lm version **0.27.1**.
+Quantization policy (by module)
+Module / Tensor name pattern	Bits	Notes
+LayerNorms: *layernorm*, input_layernorm, post_attention_layernorm	fp16 (not quantized)	Kept full precision for stability; negligible size share.
+Router: mlp.router.classifier.*	8b	Conservative to preserve expert routing fidelity.
+Embeddings: embed_tokens.*	8b	Vocabulary quality & calibration.
+LM head: lm_head.*	8b	Output logits stability & calibration.
+Self-Attention Q/K/V: `.self_attn.(q_a	q_b	kv_a(_with_mqa)?
+Self-Attention O-proj: .self_attn.o_proj.weight	4b → 6b on selected layers	Higher precision on early/late/periodic layers to reduce accumulation error.
+Switch-MLP experts: `.mlp.switch_mlp.(up	gate	down)_proj.weight`
+Experts (per-block): `.mlps.<idx>.(up	gate	down)_proj.weight`
+Everything else	low_bits fallback	Uses the converter’s low_bits default if not matched above.
+“Selected layers” (the precision bump mask)
+A layer is considered early/late/periodic if its index i (from model.layers.i) satisfies:
+i < num_layers // 8 or
+i >= 7 * num_layers // 8 or
+(i - num_layers // 8) % 3 == 2
+These layers receive:
+Q/K/V: 3b → 4b
+O-proj: 4b → 6b
+Experts (.mlps.<idx>.*): 2b → 3b
+Switch-MLP remains 3b across all layers.
+This mask preserves prompt-sensitivity (front) and output stability (tail), with a periodic boost to reduce worst-case error accumulation.
 ## Use with mlx
 ```bash