Instructions to use PKSGIN/qwen3-30b-selective-quant-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use PKSGIN/qwen3-30b-selective-quant-mlx with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="PKSGIN/qwen3-30b-selective-quant-mlx") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("PKSGIN/qwen3-30b-selective-quant-mlx") model = AutoModelForCausalLM.from_pretrained("PKSGIN/qwen3-30b-selective-quant-mlx") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use PKSGIN/qwen3-30b-selective-quant-mlx with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "PKSGIN/qwen3-30b-selective-quant-mlx" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PKSGIN/qwen3-30b-selective-quant-mlx", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/PKSGIN/qwen3-30b-selective-quant-mlx
- SGLang
How to use PKSGIN/qwen3-30b-selective-quant-mlx with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "PKSGIN/qwen3-30b-selective-quant-mlx" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PKSGIN/qwen3-30b-selective-quant-mlx", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "PKSGIN/qwen3-30b-selective-quant-mlx" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PKSGIN/qwen3-30b-selective-quant-mlx", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use PKSGIN/qwen3-30b-selective-quant-mlx with Docker Model Runner:
docker model run hf.co/PKSGIN/qwen3-30b-selective-quant-mlx
This is the MLX Model that is final that can work on you MAC Book.
My laptop is M2 2023 with 32 GB Unified Memeory
Qwen3-30B-A3B Selective Quantization for Security & Code Analysis
A selectively quantized version of Qwen3-30B-A3B optimized for local inference on Apple Silicon, with preserved precision on code analysis experts.
MLX Model Here :
PKSGIN/qwen3-30b-selective-quant-mlx (Deprecated) PKSGIN/qwen3-30b-selective-quant-MixedMPW-mlx (NEW)
Model Description
This model applies heterogeneous quantization to Qwen3-30B-A3B's Mixture-of-Experts architecture:
- 9 coding experts โ Q8 (8-bit precision)
- 119 general experts โ Q4 (4-bit precision)
- Router, attention, embeddings โ FP16 (full precision)
The result: 18 GB on disk, runs at 52 tokens/sec on MacBook Pro M4 (32GB), with materially better code analysis quality than uniform Q4.
Why Selective Quantization?
Standard quantization tools (mlx_lm.convert, llama.cpp) compress all experts uniformly. For security and coding workloads, this degrades the exact experts you need most.
This model was profiled on 200 coding/security prompts to identify which experts activate for:
- Infrastructure as Code (Terraform, Kubernetes)
- Code review and vulnerability analysis
- Security alert triage
Those 9 experts are preserved at Q8 while compressing the rest to Q4.
Target Use Cases
Local inference for sensitive workloads:
- Infrastructure as Code review without cloud API calls
- Code security analysis in air-gapped environments
- SIEM alert triage with internal IP/hostname context
- Pull request review for injection vulnerabilities
Zero data exfiltration. Zero tokens leaving your device.
Hardware Requirements
- MacBook Pro M4 with 32GB unified memory (tested)
- MacBook Pro M3/M2 with 32GB+ will work
- Peak memory: ~18 GB during inference
- Inference speed: 52 tokens/sec (M4 Pro)
Usage
# Install MLX
pip install mlx mlx-lm
# Run inference
python -m mlx_lm.generate \
--model PKSGIN/qwen3-30b-selective-quant \
--prompt "Review this Terraform for security issues: ..."
Performance
| Metric | Value |
|---|---|
| Model size on disk | 18 GB |
| Peak memory | 18.1 GB |
| Tokens/sec (M4 Pro) | 52 |
| Prompt processing | 4.4 tokens/sec |
| Compression ratio | 3.0x vs FP16 |
Quantization Details
Profiled experts (coding ratio > 0.65):
[21, 27, 31, 43, 59, 66, 71, 113, 126]
Memory breakdown:
- FP16 components (router/embeddings): 1.2 GB
- Q8 coding experts: 2.1 GB
- Q4 other experts: 16.7 GB
- Total: 20 GB weights โ 18 GB after MLX conversion
Limitations
- Apple Silicon only โ requires MLX framework
- English-centric profiling โ coding experts identified from English prompts
- Domain-specific โ optimized for code/security, may underperform on creative writing
- Not benchmarked โ no formal comparison vs uniform Q4 on CyberSecEval/HumanEval yet
Training & Profiling
- Base model: Qwen/Qwen3-30B-A3B
- Profiling: 200 coding/security prompts + 200 general prompts on A100 80GB
- Quantization: Selective Q8/Q4 applied via custom pipeline
- Conversion: Dequantized โ MLX
switch_mlpformat โ final Q4
Full pipeline: 4 scripts, ~$2 GPU rental, 72 minutes total.
Citation
@misc{qwen3-selective-quant-2026,
author = {Prasanna Kanagasabai},
title = {Qwen3-30B-A3B Selective Quantization for Security Workloads},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/PKSGIN/qwen3-30b-selective-quant}
}
License
Same as base model: Qwen License
Contact
Built by PK (@PKSGIN) โ CISO, Singapore. Email : prasanna.in@gmail.com
- Downloads last month
- 4