GLM-4.7-Flash โ SWAN Mixed-Precision (4-bit avg)
This is GLM-4.7-Flash (MoE, 30B parameters) quantized using SWAN (Statistical Weight Analysis for N-bit allocation) โ a data-free per-tensor mixed-precision quantization method for MLX on Apple Silicon.
Key Features
- Data-free quantization: No calibration dataset required โ uses weight statistics only
- Per-tensor bit allocation: Each tensor gets 2, 4, 8, or 16-bit based on sensitivity analysis
- MoE-aware: Adaptive normalization preserves expert layer precision
- MLX native: Ready for inference on Apple Silicon via
mlx_lm
Results
| Metric | BF16 | SWAN (this model) | Uniform 4-bit |
|---|---|---|---|
| PPL median (WikiText-2) | 8.61 | 9.08 (+5.5%) | 11.46 (+33%) |
| Model size | 56 GB | 15.9 GB | 14.5 GB |
SWAN significantly outperforms uniform 4-bit on this MoE model, with median PPL only 5.5% above BF16 vs 33% for uniform.
Usage
pip install mlx-lm
# Generate text
python -m mlx_lm.generate \
--model baa-ai/GLM-4.7-Flash-SWAN-4bit \
--prompt "Hello, how are you?"
# Interactive chat
python -m mlx_lm.chat --model baa-ai/GLM-4.7-Flash-SWAN-4bit
Quantization Details
- Method: SWAN v3 (hybrid normalization โ adaptive with selective fixed fallback)
- Base precision: 4-bit with selective 8-bit for shared expert layers and attention projections
- Architecture: Mixture-of-Experts with DeepSeekMLA attention
- Hardware: Quantized on Apple M2 Ultra 192GB
About SWAN
SWAN computes four sensitivity metrics per tensor: SVD spectral concentration, excess kurtosis, output noise amplification, and reconstruction error proxy. These are combined into a composite score that drives automatic bit-width allocation โ without any calibration data.
- Paper: SWAN: Data-Free Mixed-Precision Quantization for LLMs via Multi-Metric Sensitivity Analysis (Black Sheep AI Research, 2026)
- Downloads last month
- 202
Model size
30B params
Tensor type
F16
ยท
F32 ยท
U32 ยท
Hardware compatibility
Log In to add your hardware
4-bit