PKSGIN
/

qwen3-30b-selective-quant

+---
+pipeline_tag: text-generation
+library_name: transformers
+---
+# Qwen3-30B-A3B Selective Quantization for Security & Code Analysis
+A selectively quantized version of Qwen3-30B-A3B optimized for local inference on Apple Silicon, with preserved precision on code analysis experts.
+## Model Description
+This model applies **heterogeneous quantization** to Qwen3-30B-A3B's Mixture-of-Experts architecture:
+- **9 coding experts → Q8** (8-bit precision)
+- **119 general experts → Q4** (4-bit precision)
+- **Router, attention, embeddings → FP16** (full precision)
+The result: 18 GB on disk, runs at **52 tokens/sec on MacBook Pro M4** (32GB), with materially better code analysis quality than uniform Q4.
+## Why Selective Quantization?
+Standard quantization tools (`mlx_lm.convert`, `llama.cpp`) compress all experts uniformly. For security and coding workloads, this degrades the exact experts you need most.
+This model was profiled on 200 coding/security prompts to identify which experts activate for:
+- Infrastructure as Code (Terraform, Kubernetes)
+- Code review and vulnerability analysis
+- Security alert triage
+Those 9 experts are preserved at Q8 while compressing the rest to Q4.
+## Target Use Cases
+**Local inference for sensitive workloads:**
+- Infrastructure as Code review without cloud API calls
+- Code security analysis in air-gapped environments
+- SIEM alert triage with internal IP/hostname context
+- Pull request review for injection vulnerabilities
+Zero data exfiltration. Zero tokens leaving your device.
+## Hardware Requirements
+- **MacBook Pro M4** with 32GB unified memory (tested)
+- **MacBook Pro M3/M2** with 32GB+ will work
+- **Peak memory:** ~18 GB during inference
+- **Inference speed:** 52 tokens/sec (M4 Pro)
+## Usage
+```bash
+# Install MLX
+pip install mlx mlx-lm
+# Run inference
+python -m mlx_lm.generate \
+    --model PKSGIN/qwen3-30b-selective-quant \
+    --prompt "Review this Terraform for security issues: ..."
+```
+## Performance
+| Metric | Value |
+|--------|-------|
+| Model size on disk | 18 GB |
+| Peak memory | 18.1 GB |
+| Tokens/sec (M4 Pro) | 52 |
+| Prompt processing | 4.4 tokens/sec |
+| Compression ratio | 3.0x vs FP16 |
+## Quantization Details
+**Profiled experts (coding ratio > 0.65):**
+```
+[21, 27, 31, 43, 59, 66, 71, 113, 126]
+```
+**Memory breakdown:**
+- FP16 components (router/embeddings): 1.2 GB
+- Q8 coding experts: 2.1 GB
+- Q4 other experts: 16.7 GB
+- **Total:** 20 GB weights → 18 GB after MLX conversion
+## Limitations
+- **Apple Silicon only** — requires MLX framework
+- **English-centric profiling** — coding experts identified from English prompts
+- **Domain-specific** — optimized for code/security, may underperform on creative writing
+- **Not benchmarked** — no formal comparison vs uniform Q4 on CyberSecEval/HumanEval yet
+## Training & Profiling
+1. **Base model:** [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)
+2. **Profiling:** 200 coding/security prompts + 200 general prompts on A100 80GB
+3. **Quantization:** Selective Q8/Q4 applied via custom pipeline
+4. **Conversion:** Dequantized → MLX `switch_mlp` format → final Q4
+Full pipeline: 4 scripts, ~$2 GPU rental, 72 minutes total.
+## Citation
+```bibtex
+@misc{qwen3-selective-quant-2026,
+  author = {Prasanna Kanagasabai},
+  title = {Qwen3-30B-A3B Selective Quantization for Security Workloads},
+  year = {2026},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/PKSGIN/qwen3-30b-selective-quant}
+}
+```
+## License
+Same as base model: [Qwen License](https://huggingface.co/Qwen/Qwen3-30B-A3B/blob/main/LICENSE)
+## Contact
+Built by PK ([@PKSGIN](https://huggingface.co/PKSGIN)) — CISO, Singapore.
+Email : prasanna.in@gmail.com
+---