readme
Browse filesAdded Readme.
README.md
ADDED
|
@@ -0,0 +1,118 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: text-generation
|
| 3 |
+
library_name: transformers
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# Qwen3-30B-A3B Selective Quantization for Security & Code Analysis
|
| 7 |
+
|
| 8 |
+
A selectively quantized version of Qwen3-30B-A3B optimized for local inference on Apple Silicon, with preserved precision on code analysis experts.
|
| 9 |
+
|
| 10 |
+
## Model Description
|
| 11 |
+
|
| 12 |
+
This model applies **heterogeneous quantization** to Qwen3-30B-A3B's Mixture-of-Experts architecture:
|
| 13 |
+
|
| 14 |
+
- **9 coding experts β Q8** (8-bit precision)
|
| 15 |
+
- **119 general experts β Q4** (4-bit precision)
|
| 16 |
+
- **Router, attention, embeddings β FP16** (full precision)
|
| 17 |
+
|
| 18 |
+
The result: 18 GB on disk, runs at **52 tokens/sec on MacBook Pro M4** (32GB), with materially better code analysis quality than uniform Q4.
|
| 19 |
+
|
| 20 |
+
## Why Selective Quantization?
|
| 21 |
+
|
| 22 |
+
Standard quantization tools (`mlx_lm.convert`, `llama.cpp`) compress all experts uniformly. For security and coding workloads, this degrades the exact experts you need most.
|
| 23 |
+
|
| 24 |
+
This model was profiled on 200 coding/security prompts to identify which experts activate for:
|
| 25 |
+
- Infrastructure as Code (Terraform, Kubernetes)
|
| 26 |
+
- Code review and vulnerability analysis
|
| 27 |
+
- Security alert triage
|
| 28 |
+
|
| 29 |
+
Those 9 experts are preserved at Q8 while compressing the rest to Q4.
|
| 30 |
+
|
| 31 |
+
## Target Use Cases
|
| 32 |
+
|
| 33 |
+
**Local inference for sensitive workloads:**
|
| 34 |
+
|
| 35 |
+
- Infrastructure as Code review without cloud API calls
|
| 36 |
+
- Code security analysis in air-gapped environments
|
| 37 |
+
- SIEM alert triage with internal IP/hostname context
|
| 38 |
+
- Pull request review for injection vulnerabilities
|
| 39 |
+
|
| 40 |
+
Zero data exfiltration. Zero tokens leaving your device.
|
| 41 |
+
|
| 42 |
+
## Hardware Requirements
|
| 43 |
+
|
| 44 |
+
- **MacBook Pro M4** with 32GB unified memory (tested)
|
| 45 |
+
- **MacBook Pro M3/M2** with 32GB+ will work
|
| 46 |
+
- **Peak memory:** ~18 GB during inference
|
| 47 |
+
- **Inference speed:** 52 tokens/sec (M4 Pro)
|
| 48 |
+
|
| 49 |
+
## Usage
|
| 50 |
+
```bash
|
| 51 |
+
# Install MLX
|
| 52 |
+
pip install mlx mlx-lm
|
| 53 |
+
|
| 54 |
+
# Run inference
|
| 55 |
+
python -m mlx_lm.generate \
|
| 56 |
+
--model PKSGIN/qwen3-30b-selective-quant \
|
| 57 |
+
--prompt "Review this Terraform for security issues: ..."
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
## Performance
|
| 61 |
+
|
| 62 |
+
| Metric | Value |
|
| 63 |
+
|--------|-------|
|
| 64 |
+
| Model size on disk | 18 GB |
|
| 65 |
+
| Peak memory | 18.1 GB |
|
| 66 |
+
| Tokens/sec (M4 Pro) | 52 |
|
| 67 |
+
| Prompt processing | 4.4 tokens/sec |
|
| 68 |
+
| Compression ratio | 3.0x vs FP16 |
|
| 69 |
+
|
| 70 |
+
## Quantization Details
|
| 71 |
+
|
| 72 |
+
**Profiled experts (coding ratio > 0.65):**
|
| 73 |
+
```
|
| 74 |
+
[21, 27, 31, 43, 59, 66, 71, 113, 126]
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
**Memory breakdown:**
|
| 78 |
+
- FP16 components (router/embeddings): 1.2 GB
|
| 79 |
+
- Q8 coding experts: 2.1 GB
|
| 80 |
+
- Q4 other experts: 16.7 GB
|
| 81 |
+
- **Total:** 20 GB weights β 18 GB after MLX conversion
|
| 82 |
+
|
| 83 |
+
## Limitations
|
| 84 |
+
|
| 85 |
+
- **Apple Silicon only** β requires MLX framework
|
| 86 |
+
- **English-centric profiling** β coding experts identified from English prompts
|
| 87 |
+
- **Domain-specific** β optimized for code/security, may underperform on creative writing
|
| 88 |
+
- **Not benchmarked** β no formal comparison vs uniform Q4 on CyberSecEval/HumanEval yet
|
| 89 |
+
|
| 90 |
+
## Training & Profiling
|
| 91 |
+
|
| 92 |
+
1. **Base model:** [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)
|
| 93 |
+
2. **Profiling:** 200 coding/security prompts + 200 general prompts on A100 80GB
|
| 94 |
+
3. **Quantization:** Selective Q8/Q4 applied via custom pipeline
|
| 95 |
+
4. **Conversion:** Dequantized β MLX `switch_mlp` format β final Q4
|
| 96 |
+
|
| 97 |
+
Full pipeline: 4 scripts, ~$2 GPU rental, 72 minutes total.
|
| 98 |
+
|
| 99 |
+
## Citation
|
| 100 |
+
```bibtex
|
| 101 |
+
@misc{qwen3-selective-quant-2026,
|
| 102 |
+
author = {Prasanna Kanagasabai},
|
| 103 |
+
title = {Qwen3-30B-A3B Selective Quantization for Security Workloads},
|
| 104 |
+
year = {2026},
|
| 105 |
+
publisher = {HuggingFace},
|
| 106 |
+
url = {https://huggingface.co/PKSGIN/qwen3-30b-selective-quant}
|
| 107 |
+
}
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
## License
|
| 111 |
+
|
| 112 |
+
Same as base model: [Qwen License](https://huggingface.co/Qwen/Qwen3-30B-A3B/blob/main/LICENSE)
|
| 113 |
+
|
| 114 |
+
## Contact
|
| 115 |
+
|
| 116 |
+
Built by PK ([@PKSGIN](https://huggingface.co/PKSGIN)) β CISO, Singapore.
|
| 117 |
+
Email : prasanna.in@gmail.com
|
| 118 |
+
---
|