GLM-4.7-Flash — SWAN Mixed-Precision (4-bit avg)

This is GLM-4.7-Flash (MoE, 30B parameters) quantized using SWAN (Statistical Weight Analysis for N-bit allocation) — a data-free per-tensor mixed-precision quantization method for MLX on Apple Silicon.

Key Features

Data-free quantization: No calibration dataset required — uses weight statistics only
Per-tensor bit allocation: Each tensor gets 2, 4, 8, or 16-bit based on sensitivity analysis
MoE-aware: Adaptive normalization preserves expert layer precision
MLX native: Ready for inference on Apple Silicon via mlx_lm

Results

Metric	BF16	SWAN (this model)	Uniform 4-bit
PPL median (WikiText-2)	8.61	9.08 (+5.5%)	11.46 (+33%)
Model size	56 GB	15.9 GB	14.5 GB

SWAN significantly outperforms uniform 4-bit on this MoE model, with median PPL only 5.5% above BF16 vs 33% for uniform.

Usage

pip install mlx-lm

# Generate text
python -m mlx_lm.generate \
    --model baa-ai/GLM-4.7-Flash-SWAN-4bit \
    --prompt "Hello, how are you?"

# Interactive chat
python -m mlx_lm.chat --model baa-ai/GLM-4.7-Flash-SWAN-4bit

Quantization Details

Method: SWAN v3 (hybrid normalization — adaptive with selective fixed fallback)
Base precision: 4-bit with selective 8-bit for shared expert layers and attention projections
Architecture: Mixture-of-Experts with DeepSeekMLA attention
Hardware: Quantized on Apple M2 Ultra 192GB

About SWAN

SWAN computes four sensitivity metrics per tensor: SVD spectral concentration, excess kurtosis, output noise amplification, and reconstruction error proxy. These are combined into a composite score that drives automatic bit-width allocation — without any calibration data.

Paper: SWAN: Data-Free Mixed-Precision Quantization for LLMs via Multi-Metric Sensitivity Analysis (Black Sheep AI Research, 2026)

Downloads last month: 202

Safetensors

Model size

30B params

Tensor type

F16

F32

U32

MLX

Hardware compatibility

4-bit