Qwen3-30B-A3B Selective Quantization for Security & Code Analysis

A selectively quantized version of Qwen3-30B-A3B optimized for local inference on Apple Silicon, with preserved precision on code analysis experts.

MLX Model Here :

PKSGIN/qwen3-30b-selective-quant-mlx

Model Description

This model applies heterogeneous quantization to Qwen3-30B-A3B's Mixture-of-Experts architecture:

  • 9 coding experts β†’ Q8 (8-bit precision)
  • 119 general experts β†’ Q4 (4-bit precision)
  • Router, attention, embeddings β†’ FP16 (full precision)

The result: 18 GB on disk, runs at 52 tokens/sec on MacBook Pro M4 (32GB), with materially better code analysis quality than uniform Q4.

Why Selective Quantization?

Standard quantization tools (mlx_lm.convert, llama.cpp) compress all experts uniformly. For security and coding workloads, this degrades the exact experts you need most.

This model was profiled on 200 coding/security prompts to identify which experts activate for:

  • Infrastructure as Code (Terraform, Kubernetes)
  • Code review and vulnerability analysis
  • Security alert triage

Those 9 experts are preserved at Q8 while compressing the rest to Q4.

Target Use Cases

Local inference for sensitive workloads:

  • Infrastructure as Code review without cloud API calls
  • Code security analysis in air-gapped environments
  • SIEM alert triage with internal IP/hostname context
  • Pull request review for injection vulnerabilities

Zero data exfiltration. Zero tokens leaving your device.

Hardware Requirements

  • MacBook Pro M4 with 32GB unified memory (tested)
  • MacBook Pro M3/M2 with 32GB+ will work
  • Peak memory: ~18 GB during inference
  • Inference speed: 52 tokens/sec (M4 Pro)

Usage

# Install MLX
pip install mlx mlx-lm

# Run inference
python -m mlx_lm.generate \
    --model PKSGIN/qwen3-30b-selective-quant \
    --prompt "Review this Terraform for security issues: ..."

Performance

Metric Value
Model size on disk 18 GB
Peak memory 18.1 GB
Tokens/sec (M4 Pro) 52
Prompt processing 4.4 tokens/sec
Compression ratio 3.0x vs FP16

Quantization Details

Profiled experts (coding ratio > 0.65):

[21, 27, 31, 43, 59, 66, 71, 113, 126]

Memory breakdown:

  • FP16 components (router/embeddings): 1.2 GB
  • Q8 coding experts: 2.1 GB
  • Q4 other experts: 16.7 GB
  • Total: 20 GB weights β†’ 18 GB after MLX conversion

Limitations

  • Apple Silicon only β€” requires MLX framework
  • English-centric profiling β€” coding experts identified from English prompts
  • Domain-specific β€” optimized for code/security, may underperform on creative writing
  • Not benchmarked β€” no formal comparison vs uniform Q4 on CyberSecEval/HumanEval yet

Training & Profiling

  1. Base model: Qwen/Qwen3-30B-A3B
  2. Profiling: 200 coding/security prompts + 200 general prompts on A100 80GB
  3. Quantization: Selective Q8/Q4 applied via custom pipeline
  4. Conversion: Dequantized β†’ MLX switch_mlp format β†’ final Q4

Full pipeline: 4 scripts, ~$2 GPU rental, 72 minutes total.

Citation

@misc{qwen3-selective-quant-2026,
  author = {Prasanna Kanagasabai},
  title = {Qwen3-30B-A3B Selective Quantization for Security Workloads},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/PKSGIN/qwen3-30b-selective-quant}
}

License

Same as base model: Qwen License

Contact

Built by PK (@PKSGIN) β€” CISO, Singapore. Email : prasanna.in@gmail.com

Downloads last month
27
Safetensors
Model size
6B params
Tensor type
U32
Β·
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support