Qwen3-30B-A3B Selective Quantization for Security & Code Analysis
A selectively quantized version of Qwen3-30B-A3B optimized for local inference on Apple Silicon, with preserved precision on code analysis experts.
MLX Model Here :
PKSGIN/qwen3-30b-selective-quant-mlx
Model Description
This model applies heterogeneous quantization to Qwen3-30B-A3B's Mixture-of-Experts architecture:
- 9 coding experts β Q8 (8-bit precision)
- 119 general experts β Q4 (4-bit precision)
- Router, attention, embeddings β FP16 (full precision)
The result: 18 GB on disk, runs at 52 tokens/sec on MacBook Pro M4 (32GB), with materially better code analysis quality than uniform Q4.
Why Selective Quantization?
Standard quantization tools (mlx_lm.convert, llama.cpp) compress all experts uniformly. For security and coding workloads, this degrades the exact experts you need most.
This model was profiled on 200 coding/security prompts to identify which experts activate for:
- Infrastructure as Code (Terraform, Kubernetes)
- Code review and vulnerability analysis
- Security alert triage
Those 9 experts are preserved at Q8 while compressing the rest to Q4.
Target Use Cases
Local inference for sensitive workloads:
- Infrastructure as Code review without cloud API calls
- Code security analysis in air-gapped environments
- SIEM alert triage with internal IP/hostname context
- Pull request review for injection vulnerabilities
Zero data exfiltration. Zero tokens leaving your device.
Hardware Requirements
- MacBook Pro M4 with 32GB unified memory (tested)
- MacBook Pro M3/M2 with 32GB+ will work
- Peak memory: ~18 GB during inference
- Inference speed: 52 tokens/sec (M4 Pro)
Usage
# Install MLX
pip install mlx mlx-lm
# Run inference
python -m mlx_lm.generate \
--model PKSGIN/qwen3-30b-selective-quant \
--prompt "Review this Terraform for security issues: ..."
Performance
| Metric | Value |
|---|---|
| Model size on disk | 18 GB |
| Peak memory | 18.1 GB |
| Tokens/sec (M4 Pro) | 52 |
| Prompt processing | 4.4 tokens/sec |
| Compression ratio | 3.0x vs FP16 |
Quantization Details
Profiled experts (coding ratio > 0.65):
[21, 27, 31, 43, 59, 66, 71, 113, 126]
Memory breakdown:
- FP16 components (router/embeddings): 1.2 GB
- Q8 coding experts: 2.1 GB
- Q4 other experts: 16.7 GB
- Total: 20 GB weights β 18 GB after MLX conversion
Limitations
- Apple Silicon only β requires MLX framework
- English-centric profiling β coding experts identified from English prompts
- Domain-specific β optimized for code/security, may underperform on creative writing
- Not benchmarked β no formal comparison vs uniform Q4 on CyberSecEval/HumanEval yet
Training & Profiling
- Base model: Qwen/Qwen3-30B-A3B
- Profiling: 200 coding/security prompts + 200 general prompts on A100 80GB
- Quantization: Selective Q8/Q4 applied via custom pipeline
- Conversion: Dequantized β MLX
switch_mlpformat β final Q4
Full pipeline: 4 scripts, ~$2 GPU rental, 72 minutes total.
Citation
@misc{qwen3-selective-quant-2026,
author = {Prasanna Kanagasabai},
title = {Qwen3-30B-A3B Selective Quantization for Security Workloads},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/PKSGIN/qwen3-30b-selective-quant}
}
License
Same as base model: Qwen License
Contact
Built by PK (@PKSGIN) β CISO, Singapore. Email : prasanna.in@gmail.com
- Downloads last month
- 27