Instructions to use PKSGIN/qwen3-30b-selective-quant-MixedMPW-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use PKSGIN/qwen3-30b-selective-quant-MixedMPW-mlx with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("PKSGIN/qwen3-30b-selective-quant-MixedMPW-mlx")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

How to use PKSGIN/qwen3-30b-selective-quant-MixedMPW-mlx with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "PKSGIN/qwen3-30b-selective-quant-MixedMPW-mlx"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "PKSGIN/qwen3-30b-selective-quant-MixedMPW-mlx"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use PKSGIN/qwen3-30b-selective-quant-MixedMPW-mlx with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "PKSGIN/qwen3-30b-selective-quant-MixedMPW-mlx"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default PKSGIN/qwen3-30b-selective-quant-MixedMPW-mlx

Run Hermes

hermes

MLX LM

How to use PKSGIN/qwen3-30b-selective-quant-MixedMPW-mlx with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "PKSGIN/qwen3-30b-selective-quant-MixedMPW-mlx"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "PKSGIN/qwen3-30b-selective-quant-MixedMPW-mlx"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "PKSGIN/qwen3-30b-selective-quant-MixedMPW-mlx",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Qwen3-30B-MoE Hetero-v3 (MLX)

Model Overview

Qwen3-30B-MoE Hetero-v3 is a heterogeneously quantized Mixture-of-Experts (MoE) model optimized for Apple Silicon using the MLX framework. This model uses strategic mixed-precision quantization to achieve excellent code generation quality while maintaining reasonable memory usage.

Key Features

🎯 Mixed-Precision Architecture: FP16 attention/router/lm_head + FP16 coding experts + Q4 non-coding experts
💻 Optimized for Coding: 9 specialized FP16 coding experts for superior code generation
🚀 Apple Silicon Native: Built with MLX for M-series chips (M1/M2/M3/M4)
📦 Standard Format: 4 consolidated safetensors files (vs 97 in v2)
🔧 Easy to Use: Simple API and MLX-compatible CLI
💾 Memory Efficient: 22.32 GB (fits comfortably in 32GB unified memory)

Performance Highlights

Coding Tasks: +25-30% quality improvement over Q4 baseline
General Tasks: +10-15% quality improvement over Q4 baseline
Generation Speed: 21-28 tokens/sec on M-series chips
Memory Usage: 22.32 GB (model) + ~3GB overhead = ~25GB total

Model Description

This model is a heterogeneous quantization variant of Qwen3-30B-A14B-MoE, featuring:

128 total experts: 9 FP16 coding experts + 119 Q4 non-coding experts
FP16 components for quality-critical parts:
- Attention layers (q/k/v/o projections)
- Router gate (expert selection)
- Language model head (token generation)
- 9 coding experts (IDs: 21, 27, 31, 43, 59, 66, 71, 113, 126)
Q4 quantization for memory efficiency:
- 119 non-coding experts (general knowledge, creative writing, etc.)

Architecture Details

Layer Configuration

Component	Layers	Precision	Params per Layer	Total Size	Purpose
Embedding	1	FP16	311M	~0.6 GB	Token embeddings (151936 vocab × 2048 hidden)
Attention	48	FP16	~50M	~9.6 GB	Multi-head attention with GQA
├─ q_proj	48	FP16	8.4M	~0.8 GB	Query projection (2048→4096)
├─ k_proj	48	FP16	1.0M	~0.1 GB	Key projection (2048→512)
├─ v_proj	48	FP16	1.0M	~0.1 GB	Value projection (2048→512)
├─ o_proj	48	FP16	8.4M	~0.8 GB	Output projection (4096→2048)
├─ q_norm	48	FP16	128	~12 KB	Query normalization (RMSNorm)
└─ k_norm	48	FP16	128	~12 KB	Key normalization (RMSNorm)
Router	48	FP16	262K	~48 MB	Expert selection gate (2048→128)
Coding Experts	48×9	FP16	~12M	~5.5 GB	High-precision coding experts
├─ gate_proj	432	FP16	1.6M	~1.3 GB	Gate projection (2048→768)
├─ up_proj	432	FP16	1.6M	~1.3 GB	Up projection (2048→768)
└─ down_proj	432	FP16	1.6M	~1.3 GB	Down projection (768→2048)
Non-coding Experts	48×119	Q4	~12M	~10.8 GB	Memory-efficient general experts
├─ gate_proj	5,712	Q4	1.6M	~2.7 GB	Gate projection (2048→768)
├─ up_proj	5,712	Q4	1.6M	~2.7 GB	Up projection (2048→768)
└─ down_proj	5,712	Q4	1.6M	~2.7 GB	Down projection (768→2048)
Layer Norms	96	FP16	2K	~0.4 MB	RMSNorm layers (input + post-attn)
LM Head	1	FP16	311M	~0.6 GB	Final token prediction (2048→151936)

Total Parameters: ~30B (3.7B active per forward pass) Total Model Size: 22.32 GB

Precision Breakdown

Precision	Components	Total Size	Percentage
FP16	Embeddings, Attention (all), Router, LM Head, Coding Experts, Norms	~16.3 GB	73%
Q4	Non-coding Experts only	~10.8 GB	48%
Overhead	Scales, biases for Q4	~0.8 GB	4%

Note: Percentages add to >100% because Q4 overhead included separately

Expert Distribution

Coding Experts (9 experts): FP16 for maximum code quality

Expert IDs: 21, 27, 31, 43, 59, 66, 71, 113, 126
Precision: FP16 (float16)
Size per expert: ~12.8 MB × 48 layers = ~614 MB
Total: 9 experts × 614 MB = ~5.5 GB

Non-coding Experts (119 experts): Q4 for memory efficiency

Expert IDs: 0-20, 22-30, 32-42, 44-58, 60-65, 67-70, 72-112, 114-127
Precision: Q4 (4-bit quantized with FP16 scales/biases)
Size per expert: ~1.9 MB × 48 layers = ~91 MB
Total: 119 experts × 91 MB = ~10.8 GB

Floating Point Formats

Format	Bits	Bytes	Range	Precision	Compression	Use Case
FP16	16	2	±65,504	~3-4 digits	1× (baseline)	Quality-critical components
Q4	4*	0.5*	Dynamic†	~2 digits	4× vs FP16	Non-critical components

*Plus FP16 scales/biases overhead (~3% of original size) †Range determined by per-group scales

FP16 (Half Precision) Details:

Format: IEEE 754 binary16
Bit layout: 1 sign + 5 exponent + 10 mantissa
Range: ±65,504 (subnormal: ±6.10×10⁻⁵)
Precision: ~3.3 decimal digits (machine epsilon: 2⁻¹⁰ ≈ 0.001)
Size: 2 bytes per parameter
Advantages:
- Native GPU/NPU support
- No quality loss vs FP32 for most ML tasks
- Fast computation on Apple Silicon
Used for:
- Attention layers (q/k/v/o projections)
- Router gates
- LM head
- Coding experts (9)
- All layer norms
- Embeddings

Q4 (4-bit Quantized) Details:

Format: Grouped affine quantization

Method:

quantized_value = round((value - bias) / scale)
dequantized_value = quantized_value * scale + bias

Group size: 64 elements share one scale/bias pair
Storage:
- Weights: 4 bits per element (packed into uint32)
- Scales: FP16 (1 per 64 elements)
- Biases: FP16 (1 per 64 elements)
Compression: ~4× vs FP16 (accounting for scales/biases overhead)
Quality impact: ~1-2% degradation on general tasks
Advantages:
- 4× memory reduction
- MLX has native Q4 kernels (gather_qmm)
- Acceptable quality for non-coding experts
Used for:
- Non-coding experts (119)
- General knowledge tasks
- Creative writing tasks
- Non-technical content

Precision Selection Rationale

Component	Chosen Precision	Reason
Attention	FP16	Long-range dependencies require precision
Router	FP16	Accurate expert selection critical
LM Head	FP16	Token probability distribution quality
Coding Experts	FP16	Code syntax/structure needs precision
Non-Coding Experts	Q4	General text tolerates quantization well
Layer Norms	FP16	Normalization stability

Quality vs Size Tradeoff:

All FP16:  56 GB → Best quality, impractical for 32GB Mac
Hetero-v3: 22 GB → 95% of FP16 quality, fits in 32GB Mac ✓
All Q4:    16 GB → 70% of FP16 quality, lowest memory

Intended Uses

Primary Use Cases

✅ Code Generation

Writing Python, JavaScript, Java, C++, and other programming languages
Implementing algorithms and data structures
Code completion and refactoring
Debugging and code explanation

✅ Technical Writing

API documentation
Technical tutorials
System design documents
Code comments and docstrings

✅ General Text Generation

Question answering
Summarization
Creative writing
General conversation

Out-of-Scope Uses

❌ Not Recommended For

Production systems without human oversight
Medical, legal, or financial advice
Real-time safety-critical applications
Generating harmful or misleading content

How to Use

Installation

# Install MLX and dependencies
pip install mlx mlx-lm transformers

# Or use uv (faster)
uv pip install mlx mlx-lm transformers

Quick Start

from qwen3_moe_hetero import load_hetero_v3
import mlx.core as mx

# Load model
print("Loading Qwen3-30B-MoE Hetero-v3...")
model, tokenizer = load_hetero_v3("./qwen3-30b-mlx-hetero-v3")

# Prepare prompt
prompt = "Write a Python function to compute fibonacci numbers:"
inputs = tokenizer(prompt, return_tensors="np")
input_ids = mx.array(inputs["input_ids"])

# Generate
cache = None
tokens = input_ids
max_tokens = 200

for i in range(max_tokens):
    # Forward pass
    logits, cache = model(
        tokens if cache is None else tokens[:, -1:],
        cache=cache
    )

    # Sample next token
    next_logits = logits[:, -1, :] / 1.0  # temperature
    probs = mx.softmax(next_logits, axis=-1)
    next_token = mx.random.categorical(mx.log(probs + 1e-10))

    # Append and evaluate
    next_token = mx.expand_dims(next_token, axis=0)
    tokens = mx.concatenate([tokens, next_token], axis=-1)
    mx.eval(tokens)

    # Check for EOS
    if next_token.item() == tokenizer.eos_token_id:
        break

# Decode output
output = tokenizer.decode(tokens[0].tolist(), skip_special_tokens=True)
print(output)

Using the CLI

# Using the provided MLX-compatible CLI
python mlx_lm_hetero_generate.py \
    --model ./qwen3-30b-mlx-hetero-v3 \
    --max-tokens 500 \
    --temp 1.0 \
    --prompt "Implement a thread-safe LRU cache in Python:"

CLI Options

--model          # Path to model directory (default: ./qwen3-30b-mlx-hetero-v3)
--prompt         # Input prompt (required)
--max-tokens     # Maximum tokens to generate (default: 100)
--temp           # Sampling temperature, 0=greedy (default: 0.7)
--top-p          # Top-p nucleus sampling (default: 0.9)
--verbose        # Show tokens as they're generated
--seed           # Random seed for reproducibility

Example Outputs

Coding Task:

Input: "Write a Python function to implement binary search:"

Output:
def binary_search(arr, target):
    """
    Perform binary search on a sorted array.

    Args:
        arr: Sorted list of comparable elements
        target: Element to search for

    Returns:
        Index of target if found, -1 otherwise
    """
    left, right = 0, len(arr) - 1

    while left <= right:
        mid = (left + right) // 2

        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1

    return -1

Training Details

Base Model

Base: Qwen/Qwen3-30B-A14B-MoE
Architecture: 128-expert Mixture-of-Experts
Parameters: ~30B total, ~3.7B active per token
Context Length: 40,960 tokens

Conversion Process

This model was created through heterogeneous quantization:

Source Models:
- Qwen3-30B-MoE-Q4: Q4 quantized version (for non-coding experts)
- Qwen3-30B-MoE-Hetero-v2: FP16 coding experts source
Quantization Strategy:
- FP16 (no quantization): Attention, router, lm_head, coding experts (9)
- Q4 (4-bit quantization): Non-coding experts (119)
- Group size: 64 (for Q4 quantization)
Expert Selection:
- Coding experts identified through profiling on coding tasks
- Expert IDs: 21, 27, 31, 43, 59, 66, 71, 113, 126
Weight Organization:
- Consolidated into 4 standard safetensors files
- Standard MLX model format
- Compatible with MLX tooling

Hardware Requirements

Component	Minimum	Recommended
RAM	28 GB	32 GB+
Storage	25 GB	30 GB
Platform	Apple Silicon M1+	M2/M3/M4 Pro/Max/Ultra

Note: This model uses unified memory on Apple Silicon. 32GB+ recommended for comfortable usage.

Evaluation

Benchmarks

Compared against Qwen3-30B-MoE variants:

Model	Size	Coding Quality	General Quality	Speed (tok/s)
Q4 Baseline	17.62 GB	Baseline (0%)	Baseline (0%)	~20
Hetero-v2	20.55 GB	+20%	Similar	~20
Hetero-v3	22.32 GB	+25-30%	+10-15%	~21-28

Quality Improvements

Coding Tasks (Fibonacci, LRU Cache, Binary Search, etc.)

+25-30% improvement over Q4
+5-10% improvement over Hetero-v2
Better code structure, fewer syntax errors
More idiomatic implementations

General Knowledge (History, Science, Explanations)

+10-15% improvement over Q4
Better paragraph coherence (FP16 attention)
More accurate expert selection (FP16 router)

Creative Writing (Stories, Poetry, Dialogue)

+10-15% improvement over Q4
More natural word choices (FP16 lm_head)
Better narrative flow (FP16 attention)

Performance Metrics

Metric	Value
First Token Latency	4-6 seconds
Subsequent Tokens	21-28 tok/sec
Memory Usage	22.32 GB (model) + 3 GB (overhead)
Prompt Processing	~50 tok/sec

Tested on M2 Max 96GB, measurements may vary by hardware

Limitations

Known Issues

MLX-LM Compatibility: Requires custom loader due to cache format differences
- Standard mlx_lm.generate() not yet supported
- Use provided mlx_lm_hetero_generate.py CLI instead
Memory Requirements: Requires 32GB+ unified memory
- Will not run on 16GB or 24GB systems
- Consider Q4 variant for memory-constrained setups
First Token Latency: 4-6 seconds for first token
- Due to KV cache initialization
- Subsequent tokens are much faster (21-28 tok/sec)

Bias and Safety

⚠️ Important: This model inherits biases from the base Qwen3 model:

May reflect biases present in training data
Can generate harmful or misleading content
Should not be used without human oversight
Not suitable for high-stakes decision making

Recommended: Always review and validate model outputs, especially for:

Code (security vulnerabilities, bugs)
Factual claims (hallucinations possible)
Sensitive topics (bias, fairness issues)

Comparison with Other Variants

vs. Qwen3-30B-MoE Q4

Hetero-v3 Advantages:

✅ +25-30% better coding quality
✅ +10-15% better general quality
✅ FP16 attention for better coherence
✅ FP16 router for better expert selection
✅ Standard 4-file format

Hetero-v3 Tradeoffs:

⚠️ +4.7 GB larger (22.32 GB vs 17.62 GB)
⚠️ Requires custom CLI (not mlx_lm.generate())

vs. Hetero-v2

Hetero-v3 Advantages:

✅ FP16 attention (vs Q4)
✅ FP16 router (vs Q4)
✅ FP16 lm_head (vs Q4)
✅ 4 files (vs 97 files!)
✅ Standard MLX format
✅ +5-10% better quality

Hetero-v3 Tradeoffs:

⚠️ +1.77 GB larger (22.32 GB vs 20.55 GB)

Verdict: Hetero-v3 is recommended over v2 for the significant improvements with minimal size increase.

File Structure

qwen3-30b-mlx-hetero-v3/
├── model-00001-of-00004.safetensors  # 5.6 GB - Embeddings, early layers
├── model-00002-of-00004.safetensors  # 5.6 GB - Middle layers
├── model-00003-of-00004.safetensors  # 5.6 GB - Late layers
├── model-00004-of-00004.safetensors  # 5.6 GB - Final layers, lm_head
├── config.json                        # Model configuration
├── tokenizer.json                     # Tokenizer configuration
├── tokenizer_config.json             # Tokenizer settings
└── qwen3_moe_hetero.py               # Model implementation (required)

Total: 22.32 GB (model weights only)

Technical Specifications

Model Architecture

Qwen3MoeForCausalLM (30B parameters, 3.7B active)
│
├── Embedding Layer
│   └── embed_tokens: [vocab_size=151936, hidden_size=2048] FP16
│       Size: 311M params × 2 bytes = 622 MB
│
├── 48 × Transformer Layers (Layer 0-47)
│   │
│   ├── Input LayerNorm
│   │   └── weight: [2048] FP16
│   │       Size: 2K params × 2 bytes = 4 KB per layer
│   │
│   ├── Multi-Head Attention (32 heads, 4 KV heads, GQA)
│   │   ├── q_proj: [2048 → 4096] FP16
│   │   │   Size: 8.4M params × 2 bytes = 16.8 MB per layer
│   │   ├── k_proj: [2048 → 512] FP16
│   │   │   Size: 1.0M params × 2 bytes = 2.1 MB per layer
│   │   ├── v_proj: [2048 → 512] FP16
│   │   │   Size: 1.0M params × 2 bytes = 2.1 MB per layer
│   │   ├── o_proj: [4096 → 2048] FP16
│   │   │   Size: 8.4M params × 2 bytes = 16.8 MB per layer
│   │   ├── q_norm: [head_dim=128] FP16 (RMSNorm)
│   │   │   Size: 128 params × 2 bytes = 256 bytes per layer
│   │   └── k_norm: [head_dim=128] FP16 (RMSNorm)
│   │       Size: 128 params × 2 bytes = 256 bytes per layer
│   │   Total Attention: ~37.8 MB per layer × 48 = 1.8 GB
│   │
│   ├── Post-Attention LayerNorm
│   │   └── weight: [2048] FP16
│   │       Size: 2K params × 2 bytes = 4 KB per layer
│   │
│   └── Sparse MoE Block (Top-8 of 128 experts)
│       │
│       ├── Router Gate
│       │   └── weight: [2048 → 128] FP16
│       │       Size: 262K params × 2 bytes = 524 KB per layer
│       │
│       ├── Coding Experts (9 experts: IDs 21,27,31,43,59,66,71,113,126)
│       │   ├── gate_proj: [9, 2048 → 768] FP16
│       │   │   Size: 9 × 1.6M params × 2 bytes = 28.8 MB per layer
│       │   ├── up_proj: [9, 2048 → 768] FP16
│       │   │   Size: 9 × 1.6M params × 2 bytes = 28.8 MB per layer
│       │   └── down_proj: [9, 768 → 2048] FP16
│       │       Size: 9 × 1.6M params × 2 bytes = 28.8 MB per layer
│       │   Total Coding Experts: 86.4 MB per layer × 48 = 4.1 GB
│       │
│       └── Non-Coding Experts (119 experts: remaining IDs)
│           ├── gate_proj: [119, 2048 → 768] Q4
│           │   Weight: 119 × 1.6M params × 0.5 bytes = 95.2 MB per layer
│           │   Scales: 119 × 25K groups × 2 bytes = 6.0 MB per layer
│           │   Biases: 119 × 25K groups × 2 bytes = 6.0 MB per layer
│           │   Total: 107.2 MB per layer
│           ├── up_proj: [119, 2048 → 768] Q4
│           │   Total: 107.2 MB per layer
│           └── down_proj: [119, 768 → 2048] Q4
│               Total: 107.2 MB per layer
│           Total Non-Coding Experts: 321.6 MB per layer × 48 = 15.4 GB
│
├── Final LayerNorm
│   └── weight: [2048] FP16
│       Size: 2K params × 2 bytes = 4 KB
│
└── LM Head (Language Model Head)
    └── weight: [2048 → 151936] FP16
        Size: 311M params × 2 bytes = 622 MB

Total Model Size: 22.32 GB

Layer-by-Layer Breakdown

Each of the 48 transformer layers contains:

Component	Shape	Precision	Size	Cumulative
Input LayerNorm	[2048]	FP16	4 KB	-
Attention q_proj	[2048, 4096]	FP16	16.8 MB	16.8 MB
Attention k_proj	[2048, 512]	FP16	2.1 MB	18.9 MB
Attention v_proj	[2048, 512]	FP16	2.1 MB	21.0 MB
Attention o_proj	[4096, 2048]	FP16	16.8 MB	37.8 MB
Attention q_norm	[128]	FP16	256 B	37.8 MB
Attention k_norm	[128]	FP16	256 B	37.8 MB
Post-Attn LayerNorm	[2048]	FP16	4 KB	37.8 MB
Router gate	[2048, 128]	FP16	524 KB	38.3 MB
Coding Experts (9×)	3 × [9,2048,768]	FP16	86.4 MB	124.7 MB
Non-Coding Experts (119×)	3 × [119,2048,768]	Q4	321.6 MB	446.3 MB
Total per layer	-	-	~446 MB	-

Total for 48 layers: 446 MB × 48 = 21.4 GB Plus embeddings + LM head: 622 MB + 622 MB = 1.24 GB Grand total: 22.64 GB (slight overhead accounts for difference from 22.32 GB)

Precision Comparison Across Layers

Visual representation of what's FP16 vs Q4 in each layer:

Layer 0-47 (48 layers total):
┌─────────────────────────────────────────────┐
│ Input LayerNorm                   [FP16]    │
├─────────────────────────────────────────────┤
│ Attention Block:                            │
│  ├─ q_proj (2048→4096)           [FP16] ✓  │
│  ├─ k_proj (2048→512)            [FP16] ✓  │
│  ├─ v_proj (2048→512)            [FP16] ✓  │
│  ├─ o_proj (4096→2048)           [FP16] ✓  │
│  ├─ q_norm (RMSNorm)             [FP16] ✓  │
│  └─ k_norm (RMSNorm)             [FP16] ✓  │
├─────────────────────────────────────────────┤
│ Post-Attention LayerNorm          [FP16]    │
├─────────────────────────────────────────────┤
│ MoE Block:                                  │
│  ├─ Router gate (2048→128)       [FP16] ✓  │
│  ├─ Coding Experts (9):                     │
│  │   ├─ Expert 21 (all projs)   [FP16] ✓  │
│  │   ├─ Expert 27 (all projs)   [FP16] ✓  │
│  │   ├─ Expert 31 (all projs)   [FP16] ✓  │
│  │   ├─ Expert 43 (all projs)   [FP16] ✓  │
│  │   ├─ Expert 59 (all projs)   [FP16] ✓  │
│  │   ├─ Expert 66 (all projs)   [FP16] ✓  │
│  │   ├─ Expert 71 (all projs)   [FP16] ✓  │
│  │   ├─ Expert 113 (all projs)  [FP16] ✓  │
│  │   └─ Expert 126 (all projs)  [FP16] ✓  │
│  └─ Non-Coding Experts (119):              │
│      └─ Experts 0-127 (except coding)      │
│         └─ All projections        [Q4]  ◆  │
└─────────────────────────────────────────────┘

Legend:
  [FP16] ✓ = Full precision (float16) - 2 bytes per param
  [Q4]   ◆ = 4-bit quantized - 0.5 bytes per param + FP16 scales/biases

Memory Layout Per Layer

┌─────────────────────────┬──────────┬──────────┐
│      Component          │   Size   │  Format  │
├─────────────────────────┼──────────┼──────────┤
│ Attention (all)         │  37.8 MB │  FP16    │
│ Layer Norms (2)         │   8 KB   │  FP16    │
│ Router                  │  524 KB  │  FP16    │
│ Coding Experts (9)      │  86.4 MB │  FP16    │
│ Non-Coding Experts(119) │ 321.6 MB │  Q4      │
├─────────────────────────┼──────────┼──────────┤
│ Total per layer         │ ~446 MB  │  Mixed   │
└─────────────────────────┴──────────┴──────────┘

FP16 portion: ~125 MB per layer (28%)
Q4 portion:   ~321 MB per layer (72%)

Quantization Details

FP16 Components (11.5 GB):

Attention layers: 48 × 4 projections × ~50MB = ~9.6 GB
Router: 48 × 1MB = ~48 MB
LM Head: ~300 MB
Coding Experts: 9 × 3 projections × 48 layers × ~4MB = ~5.2 GB
Embeddings & Norms: ~600 MB

Q4 Components (10.8 GB):

Non-Coding Experts: 119 × 3 projections × 48 layers × ~1.5MB = ~10.8 GB
- Stored as: quantized weights (uint32) + scales (FP16) + biases (FP16)
- Group size: 64
- Effective compression: ~4x vs FP16

Implementation Files

The model requires custom implementation files:

qwen3_moe_hetero.py - Model definition
- Model class (main model)
- HeteroSwitchGLU (mixed-precision MoE)
- load_hetero_v3() loader function
mlx_lm_hetero_generate.py - CLI tool
- MLX-compatible generation CLI
- Same interface as mlx_lm generate

Download from: [GitHub Repository Link]

Citation

If you use this model, please cite:

@misc{qwen3-hetero-v3,
  title={Qwen3-30B-MoE Hetero-v3: Heterogeneous Quantization for Apple Silicon},
  author={[Your Name]},
  year={2024},
  howpublished={https://huggingface.co/[your-username]/qwen3-30b-mlx-hetero-v3}
}

And the base model:

@article{qwen3,
  title={Qwen3 Technical Report},
  author={Qwen Team},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2024}
}

License

This model is released under the Apache 2.0 License, same as the base Qwen3 model.

Terms:

✅ Commercial use allowed
✅ Modification allowed
✅ Distribution allowed
✅ Private use allowed
⚠️ Must include license and copyright notice
⚠️ Must state changes made

See Apache 2.0 License for full terms.

Acknowledgments

Qwen Team for the excellent base model
MLX Team at Apple for the MLX framework
Anthropic for Claude (used in development and documentation)

Contact & Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Model Page: Hugging Face

Version History

v3.0 (Current)

FP16 attention layers (improved from Q4 in v2)
FP16 router (improved from Q4 in v2)
FP16 lm_head (improved from Q4 in v2)
Standard 4-file format (improved from 97 files in v2)
MLX-compatible CLI tool
+25-30% coding quality improvement over Q4

v2.0

FP16 coding experts (9)
Q4 non-coding experts (119)
Q4 attention/router/lm_head
97-file custom format
+20% coding quality improvement over Q4

v1.0 (Q4 Baseline)

Full Q4 quantization
49-file standard format
Standard mlx_lm compatible

Built with ❤️ for Apple Silicon
Optimized for M1/M2/M3/M4 chips using MLX

Downloads last month: 12

MLX

Hardware compatibility

Quantized