GLM-4.7-Flash โ€” SWAN Mixed-Precision (4-bit avg)

This is GLM-4.7-Flash (MoE, 30B parameters) quantized using SWAN (Statistical Weight Analysis for N-bit allocation) โ€” a data-free per-tensor mixed-precision quantization method for MLX on Apple Silicon.

Key Features

  • Data-free quantization: No calibration dataset required โ€” uses weight statistics only
  • Per-tensor bit allocation: Each tensor gets 2, 4, 8, or 16-bit based on sensitivity analysis
  • MoE-aware: Adaptive normalization preserves expert layer precision
  • MLX native: Ready for inference on Apple Silicon via mlx_lm

Results

Metric BF16 SWAN (this model) Uniform 4-bit
PPL median (WikiText-2) 8.61 9.08 (+5.5%) 11.46 (+33%)
Model size 56 GB 15.9 GB 14.5 GB

SWAN significantly outperforms uniform 4-bit on this MoE model, with median PPL only 5.5% above BF16 vs 33% for uniform.

Usage

pip install mlx-lm

# Generate text
python -m mlx_lm.generate \
    --model baa-ai/GLM-4.7-Flash-SWAN-4bit \
    --prompt "Hello, how are you?"

# Interactive chat
python -m mlx_lm.chat --model baa-ai/GLM-4.7-Flash-SWAN-4bit

Quantization Details

  • Method: SWAN v3 (hybrid normalization โ€” adaptive with selective fixed fallback)
  • Base precision: 4-bit with selective 8-bit for shared expert layers and attention projections
  • Architecture: Mixture-of-Experts with DeepSeekMLA attention
  • Hardware: Quantized on Apple M2 Ultra 192GB

About SWAN

SWAN computes four sensitivity metrics per tensor: SVD spectral concentration, excess kurtosis, output noise amplification, and reconstruction error proxy. These are combined into a composite score that drives automatic bit-width allocation โ€” without any calibration data.

  • Paper: SWAN: Data-Free Mixed-Precision Quantization for LLMs via Multi-Metric Sensitivity Analysis (Black Sheep AI Research, 2026)
Downloads last month
202
Safetensors
Model size
30B params
Tensor type
F16
ยท
F32
ยท
U32
ยท
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using baa-ai/GLM-4.7-Flash-SWAN-4bit 1