PKSGIN commited on
Commit
d6e450d
Β·
verified Β·
1 Parent(s): ee7785c

Added Readme.

Files changed (1) hide show
  1. README.md +118 -0
README.md ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: text-generation
3
+ library_name: transformers
4
+ ---
5
+
6
+ # Qwen3-30B-A3B Selective Quantization for Security & Code Analysis
7
+
8
+ A selectively quantized version of Qwen3-30B-A3B optimized for local inference on Apple Silicon, with preserved precision on code analysis experts.
9
+
10
+ ## Model Description
11
+
12
+ This model applies **heterogeneous quantization** to Qwen3-30B-A3B's Mixture-of-Experts architecture:
13
+
14
+ - **9 coding experts β†’ Q8** (8-bit precision)
15
+ - **119 general experts β†’ Q4** (4-bit precision)
16
+ - **Router, attention, embeddings β†’ FP16** (full precision)
17
+
18
+ The result: 18 GB on disk, runs at **52 tokens/sec on MacBook Pro M4** (32GB), with materially better code analysis quality than uniform Q4.
19
+
20
+ ## Why Selective Quantization?
21
+
22
+ Standard quantization tools (`mlx_lm.convert`, `llama.cpp`) compress all experts uniformly. For security and coding workloads, this degrades the exact experts you need most.
23
+
24
+ This model was profiled on 200 coding/security prompts to identify which experts activate for:
25
+ - Infrastructure as Code (Terraform, Kubernetes)
26
+ - Code review and vulnerability analysis
27
+ - Security alert triage
28
+
29
+ Those 9 experts are preserved at Q8 while compressing the rest to Q4.
30
+
31
+ ## Target Use Cases
32
+
33
+ **Local inference for sensitive workloads:**
34
+
35
+ - Infrastructure as Code review without cloud API calls
36
+ - Code security analysis in air-gapped environments
37
+ - SIEM alert triage with internal IP/hostname context
38
+ - Pull request review for injection vulnerabilities
39
+
40
+ Zero data exfiltration. Zero tokens leaving your device.
41
+
42
+ ## Hardware Requirements
43
+
44
+ - **MacBook Pro M4** with 32GB unified memory (tested)
45
+ - **MacBook Pro M3/M2** with 32GB+ will work
46
+ - **Peak memory:** ~18 GB during inference
47
+ - **Inference speed:** 52 tokens/sec (M4 Pro)
48
+
49
+ ## Usage
50
+ ```bash
51
+ # Install MLX
52
+ pip install mlx mlx-lm
53
+
54
+ # Run inference
55
+ python -m mlx_lm.generate \
56
+ --model PKSGIN/qwen3-30b-selective-quant \
57
+ --prompt "Review this Terraform for security issues: ..."
58
+ ```
59
+
60
+ ## Performance
61
+
62
+ | Metric | Value |
63
+ |--------|-------|
64
+ | Model size on disk | 18 GB |
65
+ | Peak memory | 18.1 GB |
66
+ | Tokens/sec (M4 Pro) | 52 |
67
+ | Prompt processing | 4.4 tokens/sec |
68
+ | Compression ratio | 3.0x vs FP16 |
69
+
70
+ ## Quantization Details
71
+
72
+ **Profiled experts (coding ratio > 0.65):**
73
+ ```
74
+ [21, 27, 31, 43, 59, 66, 71, 113, 126]
75
+ ```
76
+
77
+ **Memory breakdown:**
78
+ - FP16 components (router/embeddings): 1.2 GB
79
+ - Q8 coding experts: 2.1 GB
80
+ - Q4 other experts: 16.7 GB
81
+ - **Total:** 20 GB weights β†’ 18 GB after MLX conversion
82
+
83
+ ## Limitations
84
+
85
+ - **Apple Silicon only** β€” requires MLX framework
86
+ - **English-centric profiling** β€” coding experts identified from English prompts
87
+ - **Domain-specific** β€” optimized for code/security, may underperform on creative writing
88
+ - **Not benchmarked** β€” no formal comparison vs uniform Q4 on CyberSecEval/HumanEval yet
89
+
90
+ ## Training & Profiling
91
+
92
+ 1. **Base model:** [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)
93
+ 2. **Profiling:** 200 coding/security prompts + 200 general prompts on A100 80GB
94
+ 3. **Quantization:** Selective Q8/Q4 applied via custom pipeline
95
+ 4. **Conversion:** Dequantized β†’ MLX `switch_mlp` format β†’ final Q4
96
+
97
+ Full pipeline: 4 scripts, ~$2 GPU rental, 72 minutes total.
98
+
99
+ ## Citation
100
+ ```bibtex
101
+ @misc{qwen3-selective-quant-2026,
102
+ author = {Prasanna Kanagasabai},
103
+ title = {Qwen3-30B-A3B Selective Quantization for Security Workloads},
104
+ year = {2026},
105
+ publisher = {HuggingFace},
106
+ url = {https://huggingface.co/PKSGIN/qwen3-30b-selective-quant}
107
+ }
108
+ ```
109
+
110
+ ## License
111
+
112
+ Same as base model: [Qwen License](https://huggingface.co/Qwen/Qwen3-30B-A3B/blob/main/LICENSE)
113
+
114
+ ## Contact
115
+
116
+ Built by PK ([@PKSGIN](https://huggingface.co/PKSGIN)) β€” CISO, Singapore.
117
+ Email : prasanna.in@gmail.com
118
+ ---