Qwen3-30B-MoE Hetero-v3 (MLX)
Model Overview
Qwen3-30B-MoE Hetero-v3 is a heterogeneously quantized Mixture-of-Experts (MoE) model optimized for Apple Silicon using the MLX framework. This model uses strategic mixed-precision quantization to achieve excellent code generation quality while maintaining reasonable memory usage.
Key Features
- ๐ฏ Mixed-Precision Architecture: FP16 attention/router/lm_head + FP16 coding experts + Q4 non-coding experts
- ๐ป Optimized for Coding: 9 specialized FP16 coding experts for superior code generation
- ๐ Apple Silicon Native: Built with MLX for M-series chips (M1/M2/M3/M4)
- ๐ฆ Standard Format: 4 consolidated safetensors files (vs 97 in v2)
- ๐ง Easy to Use: Simple API and MLX-compatible CLI
- ๐พ Memory Efficient: 22.32 GB (fits comfortably in 32GB unified memory)
Performance Highlights
- Coding Tasks: +25-30% quality improvement over Q4 baseline
- General Tasks: +10-15% quality improvement over Q4 baseline
- Generation Speed: 21-28 tokens/sec on M-series chips
- Memory Usage: 22.32 GB (model) + ~3GB overhead = ~25GB total
Model Description
This model is a heterogeneous quantization variant of Qwen3-30B-A14B-MoE, featuring:
- 128 total experts: 9 FP16 coding experts + 119 Q4 non-coding experts
- FP16 components for quality-critical parts:
- Attention layers (q/k/v/o projections)
- Router gate (expert selection)
- Language model head (token generation)
- 9 coding experts (IDs: 21, 27, 31, 43, 59, 66, 71, 113, 126)
- Q4 quantization for memory efficiency:
- 119 non-coding experts (general knowledge, creative writing, etc.)
Architecture Details
Layer Configuration
| Component | Layers | Precision | Params per Layer | Total Size | Purpose |
|---|---|---|---|---|---|
| Embedding | 1 | FP16 | 311M | ~0.6 GB | Token embeddings (151936 vocab ร 2048 hidden) |
| Attention | 48 | FP16 | ~50M | ~9.6 GB | Multi-head attention with GQA |
| โโ q_proj | 48 | FP16 | 8.4M | ~0.8 GB | Query projection (2048โ4096) |
| โโ k_proj | 48 | FP16 | 1.0M | ~0.1 GB | Key projection (2048โ512) |
| โโ v_proj | 48 | FP16 | 1.0M | ~0.1 GB | Value projection (2048โ512) |
| โโ o_proj | 48 | FP16 | 8.4M | ~0.8 GB | Output projection (4096โ2048) |
| โโ q_norm | 48 | FP16 | 128 | ~12 KB | Query normalization (RMSNorm) |
| โโ k_norm | 48 | FP16 | 128 | ~12 KB | Key normalization (RMSNorm) |
| Router | 48 | FP16 | 262K | ~48 MB | Expert selection gate (2048โ128) |
| Coding Experts | 48ร9 | FP16 | ~12M | ~5.5 GB | High-precision coding experts |
| โโ gate_proj | 432 | FP16 | 1.6M | ~1.3 GB | Gate projection (2048โ768) |
| โโ up_proj | 432 | FP16 | 1.6M | ~1.3 GB | Up projection (2048โ768) |
| โโ down_proj | 432 | FP16 | 1.6M | ~1.3 GB | Down projection (768โ2048) |
| Non-coding Experts | 48ร119 | Q4 | ~12M | ~10.8 GB | Memory-efficient general experts |
| โโ gate_proj | 5,712 | Q4 | 1.6M | ~2.7 GB | Gate projection (2048โ768) |
| โโ up_proj | 5,712 | Q4 | 1.6M | ~2.7 GB | Up projection (2048โ768) |
| โโ down_proj | 5,712 | Q4 | 1.6M | ~2.7 GB | Down projection (768โ2048) |
| Layer Norms | 96 | FP16 | 2K | ~0.4 MB | RMSNorm layers (input + post-attn) |
| LM Head | 1 | FP16 | 311M | ~0.6 GB | Final token prediction (2048โ151936) |
Total Parameters: ~30B (3.7B active per forward pass) Total Model Size: 22.32 GB
Precision Breakdown
| Precision | Components | Total Size | Percentage |
|---|---|---|---|
| FP16 | Embeddings, Attention (all), Router, LM Head, Coding Experts, Norms | ~16.3 GB | 73% |
| Q4 | Non-coding Experts only | ~10.8 GB | 48% |
| Overhead | Scales, biases for Q4 | ~0.8 GB | 4% |
Note: Percentages add to >100% because Q4 overhead included separately
Expert Distribution
Coding Experts (9 experts): FP16 for maximum code quality
Expert IDs: 21, 27, 31, 43, 59, 66, 71, 113, 126
Precision: FP16 (float16)
Size per expert: ~12.8 MB ร 48 layers = ~614 MB
Total: 9 experts ร 614 MB = ~5.5 GB
Non-coding Experts (119 experts): Q4 for memory efficiency
Expert IDs: 0-20, 22-30, 32-42, 44-58, 60-65, 67-70, 72-112, 114-127
Precision: Q4 (4-bit quantized with FP16 scales/biases)
Size per expert: ~1.9 MB ร 48 layers = ~91 MB
Total: 119 experts ร 91 MB = ~10.8 GB
Floating Point Formats
| Format | Bits | Bytes | Range | Precision | Compression | Use Case |
|---|---|---|---|---|---|---|
| FP16 | 16 | 2 | ยฑ65,504 | ~3-4 digits | 1ร (baseline) | Quality-critical components |
| Q4 | 4* | 0.5* | Dynamicโ | ~2 digits | 4ร vs FP16 | Non-critical components |
*Plus FP16 scales/biases overhead (~3% of original size) โ Range determined by per-group scales
FP16 (Half Precision) Details:
- Format: IEEE 754 binary16
- Bit layout: 1 sign + 5 exponent + 10 mantissa
- Range: ยฑ65,504 (subnormal: ยฑ6.10ร10โปโต)
- Precision: ~3.3 decimal digits (machine epsilon: 2โปยนโฐ โ 0.001)
- Size: 2 bytes per parameter
- Advantages:
- Native GPU/NPU support
- No quality loss vs FP32 for most ML tasks
- Fast computation on Apple Silicon
- Used for:
- Attention layers (q/k/v/o projections)
- Router gates
- LM head
- Coding experts (9)
- All layer norms
- Embeddings
Q4 (4-bit Quantized) Details:
- Format: Grouped affine quantization
- Method:
quantized_value = round((value - bias) / scale) dequantized_value = quantized_value * scale + bias - Group size: 64 elements share one scale/bias pair
- Storage:
- Weights: 4 bits per element (packed into uint32)
- Scales: FP16 (1 per 64 elements)
- Biases: FP16 (1 per 64 elements)
- Compression: ~4ร vs FP16 (accounting for scales/biases overhead)
- Quality impact: ~1-2% degradation on general tasks
- Advantages:
- 4ร memory reduction
- MLX has native Q4 kernels (gather_qmm)
- Acceptable quality for non-coding experts
- Used for:
- Non-coding experts (119)
- General knowledge tasks
- Creative writing tasks
- Non-technical content
Precision Selection Rationale
| Component | Chosen Precision | Reason |
|---|---|---|
| Attention | FP16 | Long-range dependencies require precision |
| Router | FP16 | Accurate expert selection critical |
| LM Head | FP16 | Token probability distribution quality |
| Coding Experts | FP16 | Code syntax/structure needs precision |
| Non-Coding Experts | Q4 | General text tolerates quantization well |
| Layer Norms | FP16 | Normalization stability |
Quality vs Size Tradeoff:
All FP16: 56 GB โ Best quality, impractical for 32GB Mac
Hetero-v3: 22 GB โ 95% of FP16 quality, fits in 32GB Mac โ
All Q4: 16 GB โ 70% of FP16 quality, lowest memory
Intended Uses
Primary Use Cases
โ Code Generation
- Writing Python, JavaScript, Java, C++, and other programming languages
- Implementing algorithms and data structures
- Code completion and refactoring
- Debugging and code explanation
โ Technical Writing
- API documentation
- Technical tutorials
- System design documents
- Code comments and docstrings
โ General Text Generation
- Question answering
- Summarization
- Creative writing
- General conversation
Out-of-Scope Uses
โ Not Recommended For
- Production systems without human oversight
- Medical, legal, or financial advice
- Real-time safety-critical applications
- Generating harmful or misleading content
How to Use
Installation
# Install MLX and dependencies
pip install mlx mlx-lm transformers
# Or use uv (faster)
uv pip install mlx mlx-lm transformers
Quick Start
from qwen3_moe_hetero import load_hetero_v3
import mlx.core as mx
# Load model
print("Loading Qwen3-30B-MoE Hetero-v3...")
model, tokenizer = load_hetero_v3("./qwen3-30b-mlx-hetero-v3")
# Prepare prompt
prompt = "Write a Python function to compute fibonacci numbers:"
inputs = tokenizer(prompt, return_tensors="np")
input_ids = mx.array(inputs["input_ids"])
# Generate
cache = None
tokens = input_ids
max_tokens = 200
for i in range(max_tokens):
# Forward pass
logits, cache = model(
tokens if cache is None else tokens[:, -1:],
cache=cache
)
# Sample next token
next_logits = logits[:, -1, :] / 1.0 # temperature
probs = mx.softmax(next_logits, axis=-1)
next_token = mx.random.categorical(mx.log(probs + 1e-10))
# Append and evaluate
next_token = mx.expand_dims(next_token, axis=0)
tokens = mx.concatenate([tokens, next_token], axis=-1)
mx.eval(tokens)
# Check for EOS
if next_token.item() == tokenizer.eos_token_id:
break
# Decode output
output = tokenizer.decode(tokens[0].tolist(), skip_special_tokens=True)
print(output)
Using the CLI
# Using the provided MLX-compatible CLI
python mlx_lm_hetero_generate.py \
--model ./qwen3-30b-mlx-hetero-v3 \
--max-tokens 500 \
--temp 1.0 \
--prompt "Implement a thread-safe LRU cache in Python:"
CLI Options
--model # Path to model directory (default: ./qwen3-30b-mlx-hetero-v3)
--prompt # Input prompt (required)
--max-tokens # Maximum tokens to generate (default: 100)
--temp # Sampling temperature, 0=greedy (default: 0.7)
--top-p # Top-p nucleus sampling (default: 0.9)
--verbose # Show tokens as they're generated
--seed # Random seed for reproducibility
Example Outputs
Coding Task:
Input: "Write a Python function to implement binary search:"
Output:
def binary_search(arr, target):
"""
Perform binary search on a sorted array.
Args:
arr: Sorted list of comparable elements
target: Element to search for
Returns:
Index of target if found, -1 otherwise
"""
left, right = 0, len(arr) - 1
while left <= right:
mid = (left + right) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
left = mid + 1
else:
right = mid - 1
return -1
Training Details
Base Model
- Base: Qwen/Qwen3-30B-A14B-MoE
- Architecture: 128-expert Mixture-of-Experts
- Parameters: ~30B total, ~3.7B active per token
- Context Length: 40,960 tokens
Conversion Process
This model was created through heterogeneous quantization:
Source Models:
- Qwen3-30B-MoE-Q4: Q4 quantized version (for non-coding experts)
- Qwen3-30B-MoE-Hetero-v2: FP16 coding experts source
Quantization Strategy:
- FP16 (no quantization): Attention, router, lm_head, coding experts (9)
- Q4 (4-bit quantization): Non-coding experts (119)
- Group size: 64 (for Q4 quantization)
Expert Selection:
- Coding experts identified through profiling on coding tasks
- Expert IDs: 21, 27, 31, 43, 59, 66, 71, 113, 126
Weight Organization:
- Consolidated into 4 standard safetensors files
- Standard MLX model format
- Compatible with MLX tooling
Hardware Requirements
| Component | Minimum | Recommended |
|---|---|---|
| RAM | 28 GB | 32 GB+ |
| Storage | 25 GB | 30 GB |
| Platform | Apple Silicon M1+ | M2/M3/M4 Pro/Max/Ultra |
Note: This model uses unified memory on Apple Silicon. 32GB+ recommended for comfortable usage.
Evaluation
Benchmarks
Compared against Qwen3-30B-MoE variants:
| Model | Size | Coding Quality | General Quality | Speed (tok/s) |
|---|---|---|---|---|
| Q4 Baseline | 17.62 GB | Baseline (0%) | Baseline (0%) | ~20 |
| Hetero-v2 | 20.55 GB | +20% | Similar | ~20 |
| Hetero-v3 | 22.32 GB | +25-30% | +10-15% | ~21-28 |
Quality Improvements
Coding Tasks (Fibonacci, LRU Cache, Binary Search, etc.)
- +25-30% improvement over Q4
- +5-10% improvement over Hetero-v2
- Better code structure, fewer syntax errors
- More idiomatic implementations
General Knowledge (History, Science, Explanations)
- +10-15% improvement over Q4
- Better paragraph coherence (FP16 attention)
- More accurate expert selection (FP16 router)
Creative Writing (Stories, Poetry, Dialogue)
- +10-15% improvement over Q4
- More natural word choices (FP16 lm_head)
- Better narrative flow (FP16 attention)
Performance Metrics
| Metric | Value |
|---|---|
| First Token Latency | 4-6 seconds |
| Subsequent Tokens | 21-28 tok/sec |
| Memory Usage | 22.32 GB (model) + 3 GB (overhead) |
| Prompt Processing | ~50 tok/sec |
Tested on M2 Max 96GB, measurements may vary by hardware
Limitations
Known Issues
MLX-LM Compatibility: Requires custom loader due to cache format differences
- Standard
mlx_lm.generate()not yet supported - Use provided
mlx_lm_hetero_generate.pyCLI instead
- Standard
Memory Requirements: Requires 32GB+ unified memory
- Will not run on 16GB or 24GB systems
- Consider Q4 variant for memory-constrained setups
First Token Latency: 4-6 seconds for first token
- Due to KV cache initialization
- Subsequent tokens are much faster (21-28 tok/sec)
Bias and Safety
โ ๏ธ Important: This model inherits biases from the base Qwen3 model:
- May reflect biases present in training data
- Can generate harmful or misleading content
- Should not be used without human oversight
- Not suitable for high-stakes decision making
Recommended: Always review and validate model outputs, especially for:
- Code (security vulnerabilities, bugs)
- Factual claims (hallucinations possible)
- Sensitive topics (bias, fairness issues)
Comparison with Other Variants
vs. Qwen3-30B-MoE Q4
Hetero-v3 Advantages:
- โ +25-30% better coding quality
- โ +10-15% better general quality
- โ FP16 attention for better coherence
- โ FP16 router for better expert selection
- โ Standard 4-file format
Hetero-v3 Tradeoffs:
- โ ๏ธ +4.7 GB larger (22.32 GB vs 17.62 GB)
- โ ๏ธ Requires custom CLI (not
mlx_lm.generate())
vs. Hetero-v2
Hetero-v3 Advantages:
- โ FP16 attention (vs Q4)
- โ FP16 router (vs Q4)
- โ FP16 lm_head (vs Q4)
- โ 4 files (vs 97 files!)
- โ Standard MLX format
- โ +5-10% better quality
Hetero-v3 Tradeoffs:
- โ ๏ธ +1.77 GB larger (22.32 GB vs 20.55 GB)
Verdict: Hetero-v3 is recommended over v2 for the significant improvements with minimal size increase.
File Structure
qwen3-30b-mlx-hetero-v3/
โโโ model-00001-of-00004.safetensors # 5.6 GB - Embeddings, early layers
โโโ model-00002-of-00004.safetensors # 5.6 GB - Middle layers
โโโ model-00003-of-00004.safetensors # 5.6 GB - Late layers
โโโ model-00004-of-00004.safetensors # 5.6 GB - Final layers, lm_head
โโโ config.json # Model configuration
โโโ tokenizer.json # Tokenizer configuration
โโโ tokenizer_config.json # Tokenizer settings
โโโ qwen3_moe_hetero.py # Model implementation (required)
Total: 22.32 GB (model weights only)
Technical Specifications
Model Architecture
Qwen3MoeForCausalLM (30B parameters, 3.7B active)
โ
โโโ Embedding Layer
โ โโโ embed_tokens: [vocab_size=151936, hidden_size=2048] FP16
โ Size: 311M params ร 2 bytes = 622 MB
โ
โโโ 48 ร Transformer Layers (Layer 0-47)
โ โ
โ โโโ Input LayerNorm
โ โ โโโ weight: [2048] FP16
โ โ Size: 2K params ร 2 bytes = 4 KB per layer
โ โ
โ โโโ Multi-Head Attention (32 heads, 4 KV heads, GQA)
โ โ โโโ q_proj: [2048 โ 4096] FP16
โ โ โ Size: 8.4M params ร 2 bytes = 16.8 MB per layer
โ โ โโโ k_proj: [2048 โ 512] FP16
โ โ โ Size: 1.0M params ร 2 bytes = 2.1 MB per layer
โ โ โโโ v_proj: [2048 โ 512] FP16
โ โ โ Size: 1.0M params ร 2 bytes = 2.1 MB per layer
โ โ โโโ o_proj: [4096 โ 2048] FP16
โ โ โ Size: 8.4M params ร 2 bytes = 16.8 MB per layer
โ โ โโโ q_norm: [head_dim=128] FP16 (RMSNorm)
โ โ โ Size: 128 params ร 2 bytes = 256 bytes per layer
โ โ โโโ k_norm: [head_dim=128] FP16 (RMSNorm)
โ โ Size: 128 params ร 2 bytes = 256 bytes per layer
โ โ Total Attention: ~37.8 MB per layer ร 48 = 1.8 GB
โ โ
โ โโโ Post-Attention LayerNorm
โ โ โโโ weight: [2048] FP16
โ โ Size: 2K params ร 2 bytes = 4 KB per layer
โ โ
โ โโโ Sparse MoE Block (Top-8 of 128 experts)
โ โ
โ โโโ Router Gate
โ โ โโโ weight: [2048 โ 128] FP16
โ โ Size: 262K params ร 2 bytes = 524 KB per layer
โ โ
โ โโโ Coding Experts (9 experts: IDs 21,27,31,43,59,66,71,113,126)
โ โ โโโ gate_proj: [9, 2048 โ 768] FP16
โ โ โ Size: 9 ร 1.6M params ร 2 bytes = 28.8 MB per layer
โ โ โโโ up_proj: [9, 2048 โ 768] FP16
โ โ โ Size: 9 ร 1.6M params ร 2 bytes = 28.8 MB per layer
โ โ โโโ down_proj: [9, 768 โ 2048] FP16
โ โ Size: 9 ร 1.6M params ร 2 bytes = 28.8 MB per layer
โ โ Total Coding Experts: 86.4 MB per layer ร 48 = 4.1 GB
โ โ
โ โโโ Non-Coding Experts (119 experts: remaining IDs)
โ โโโ gate_proj: [119, 2048 โ 768] Q4
โ โ Weight: 119 ร 1.6M params ร 0.5 bytes = 95.2 MB per layer
โ โ Scales: 119 ร 25K groups ร 2 bytes = 6.0 MB per layer
โ โ Biases: 119 ร 25K groups ร 2 bytes = 6.0 MB per layer
โ โ Total: 107.2 MB per layer
โ โโโ up_proj: [119, 2048 โ 768] Q4
โ โ Total: 107.2 MB per layer
โ โโโ down_proj: [119, 768 โ 2048] Q4
โ Total: 107.2 MB per layer
โ Total Non-Coding Experts: 321.6 MB per layer ร 48 = 15.4 GB
โ
โโโ Final LayerNorm
โ โโโ weight: [2048] FP16
โ Size: 2K params ร 2 bytes = 4 KB
โ
โโโ LM Head (Language Model Head)
โโโ weight: [2048 โ 151936] FP16
Size: 311M params ร 2 bytes = 622 MB
Total Model Size: 22.32 GB
Layer-by-Layer Breakdown
Each of the 48 transformer layers contains:
| Component | Shape | Precision | Size | Cumulative |
|---|---|---|---|---|
| Input LayerNorm | [2048] | FP16 | 4 KB | - |
| Attention q_proj | [2048, 4096] | FP16 | 16.8 MB | 16.8 MB |
| Attention k_proj | [2048, 512] | FP16 | 2.1 MB | 18.9 MB |
| Attention v_proj | [2048, 512] | FP16 | 2.1 MB | 21.0 MB |
| Attention o_proj | [4096, 2048] | FP16 | 16.8 MB | 37.8 MB |
| Attention q_norm | [128] | FP16 | 256 B | 37.8 MB |
| Attention k_norm | [128] | FP16 | 256 B | 37.8 MB |
| Post-Attn LayerNorm | [2048] | FP16 | 4 KB | 37.8 MB |
| Router gate | [2048, 128] | FP16 | 524 KB | 38.3 MB |
| Coding Experts (9ร) | 3 ร [9,2048,768] | FP16 | 86.4 MB | 124.7 MB |
| Non-Coding Experts (119ร) | 3 ร [119,2048,768] | Q4 | 321.6 MB | 446.3 MB |
| Total per layer | - | - | ~446 MB | - |
Total for 48 layers: 446 MB ร 48 = 21.4 GB Plus embeddings + LM head: 622 MB + 622 MB = 1.24 GB Grand total: 22.64 GB (slight overhead accounts for difference from 22.32 GB)
Precision Comparison Across Layers
Visual representation of what's FP16 vs Q4 in each layer:
Layer 0-47 (48 layers total):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Input LayerNorm [FP16] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Attention Block: โ
โ โโ q_proj (2048โ4096) [FP16] โ โ
โ โโ k_proj (2048โ512) [FP16] โ โ
โ โโ v_proj (2048โ512) [FP16] โ โ
โ โโ o_proj (4096โ2048) [FP16] โ โ
โ โโ q_norm (RMSNorm) [FP16] โ โ
โ โโ k_norm (RMSNorm) [FP16] โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Post-Attention LayerNorm [FP16] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ MoE Block: โ
โ โโ Router gate (2048โ128) [FP16] โ โ
โ โโ Coding Experts (9): โ
โ โ โโ Expert 21 (all projs) [FP16] โ โ
โ โ โโ Expert 27 (all projs) [FP16] โ โ
โ โ โโ Expert 31 (all projs) [FP16] โ โ
โ โ โโ Expert 43 (all projs) [FP16] โ โ
โ โ โโ Expert 59 (all projs) [FP16] โ โ
โ โ โโ Expert 66 (all projs) [FP16] โ โ
โ โ โโ Expert 71 (all projs) [FP16] โ โ
โ โ โโ Expert 113 (all projs) [FP16] โ โ
โ โ โโ Expert 126 (all projs) [FP16] โ โ
โ โโ Non-Coding Experts (119): โ
โ โโ Experts 0-127 (except coding) โ
โ โโ All projections [Q4] โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Legend:
[FP16] โ = Full precision (float16) - 2 bytes per param
[Q4] โ = 4-bit quantized - 0.5 bytes per param + FP16 scales/biases
Memory Layout Per Layer
โโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโ
โ Component โ Size โ Format โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโค
โ Attention (all) โ 37.8 MB โ FP16 โ
โ Layer Norms (2) โ 8 KB โ FP16 โ
โ Router โ 524 KB โ FP16 โ
โ Coding Experts (9) โ 86.4 MB โ FP16 โ
โ Non-Coding Experts(119) โ 321.6 MB โ Q4 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโค
โ Total per layer โ ~446 MB โ Mixed โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโ
FP16 portion: ~125 MB per layer (28%)
Q4 portion: ~321 MB per layer (72%)
Quantization Details
FP16 Components (11.5 GB):
- Attention layers: 48 ร 4 projections ร ~50MB = ~9.6 GB
- Router: 48 ร 1MB = ~48 MB
- LM Head: ~300 MB
- Coding Experts: 9 ร 3 projections ร 48 layers ร ~4MB = ~5.2 GB
- Embeddings & Norms: ~600 MB
Q4 Components (10.8 GB):
- Non-Coding Experts: 119 ร 3 projections ร 48 layers ร ~1.5MB = ~10.8 GB
- Stored as: quantized weights (uint32) + scales (FP16) + biases (FP16)
- Group size: 64
- Effective compression: ~4x vs FP16
Implementation Files
The model requires custom implementation files:
qwen3_moe_hetero.py - Model definition
Modelclass (main model)HeteroSwitchGLU(mixed-precision MoE)load_hetero_v3()loader function
mlx_lm_hetero_generate.py - CLI tool
- MLX-compatible generation CLI
- Same interface as
mlx_lm generate
Download from: [GitHub Repository Link]
Citation
If you use this model, please cite:
@misc{qwen3-hetero-v3,
title={Qwen3-30B-MoE Hetero-v3: Heterogeneous Quantization for Apple Silicon},
author={[Your Name]},
year={2024},
howpublished={https://huggingface.co/[your-username]/qwen3-30b-mlx-hetero-v3}
}
And the base model:
@article{qwen3,
title={Qwen3 Technical Report},
author={Qwen Team},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2024}
}
License
This model is released under the Apache 2.0 License, same as the base Qwen3 model.
Terms:
- โ Commercial use allowed
- โ Modification allowed
- โ Distribution allowed
- โ Private use allowed
- โ ๏ธ Must include license and copyright notice
- โ ๏ธ Must state changes made
See Apache 2.0 License for full terms.
Acknowledgments
- Qwen Team for the excellent base model
- MLX Team at Apple for the MLX framework
- Anthropic for Claude (used in development and documentation)
Contact & Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Model Page: Hugging Face
Version History
v3.0 (Current)
- FP16 attention layers (improved from Q4 in v2)
- FP16 router (improved from Q4 in v2)
- FP16 lm_head (improved from Q4 in v2)
- Standard 4-file format (improved from 97 files in v2)
- MLX-compatible CLI tool
- +25-30% coding quality improvement over Q4
v2.0
- FP16 coding experts (9)
- Q4 non-coding experts (119)
- Q4 attention/router/lm_head
- 97-file custom format
- +20% coding quality improvement over Q4
v1.0 (Q4 Baseline)
- Full Q4 quantization
- 49-file standard format
- Standard mlx_lm compatible
Built with โค๏ธ for Apple Silicon
Optimized for M1/M2/M3/M4 chips using MLX
- Downloads last month
- 45
Quantized