Qwen3-30B-MoE Hetero-v3 (MLX)

Model Overview

Qwen3-30B-MoE Hetero-v3 is a heterogeneously quantized Mixture-of-Experts (MoE) model optimized for Apple Silicon using the MLX framework. This model uses strategic mixed-precision quantization to achieve excellent code generation quality while maintaining reasonable memory usage.

Key Features

  • ๐ŸŽฏ Mixed-Precision Architecture: FP16 attention/router/lm_head + FP16 coding experts + Q4 non-coding experts
  • ๐Ÿ’ป Optimized for Coding: 9 specialized FP16 coding experts for superior code generation
  • ๐Ÿš€ Apple Silicon Native: Built with MLX for M-series chips (M1/M2/M3/M4)
  • ๐Ÿ“ฆ Standard Format: 4 consolidated safetensors files (vs 97 in v2)
  • ๐Ÿ”ง Easy to Use: Simple API and MLX-compatible CLI
  • ๐Ÿ’พ Memory Efficient: 22.32 GB (fits comfortably in 32GB unified memory)

Performance Highlights

  • Coding Tasks: +25-30% quality improvement over Q4 baseline
  • General Tasks: +10-15% quality improvement over Q4 baseline
  • Generation Speed: 21-28 tokens/sec on M-series chips
  • Memory Usage: 22.32 GB (model) + ~3GB overhead = ~25GB total

Model Description

This model is a heterogeneous quantization variant of Qwen3-30B-A14B-MoE, featuring:

  • 128 total experts: 9 FP16 coding experts + 119 Q4 non-coding experts
  • FP16 components for quality-critical parts:
    • Attention layers (q/k/v/o projections)
    • Router gate (expert selection)
    • Language model head (token generation)
    • 9 coding experts (IDs: 21, 27, 31, 43, 59, 66, 71, 113, 126)
  • Q4 quantization for memory efficiency:
    • 119 non-coding experts (general knowledge, creative writing, etc.)

Architecture Details

Layer Configuration

Component Layers Precision Params per Layer Total Size Purpose
Embedding 1 FP16 311M ~0.6 GB Token embeddings (151936 vocab ร— 2048 hidden)
Attention 48 FP16 ~50M ~9.6 GB Multi-head attention with GQA
โ”œโ”€ q_proj 48 FP16 8.4M ~0.8 GB Query projection (2048โ†’4096)
โ”œโ”€ k_proj 48 FP16 1.0M ~0.1 GB Key projection (2048โ†’512)
โ”œโ”€ v_proj 48 FP16 1.0M ~0.1 GB Value projection (2048โ†’512)
โ”œโ”€ o_proj 48 FP16 8.4M ~0.8 GB Output projection (4096โ†’2048)
โ”œโ”€ q_norm 48 FP16 128 ~12 KB Query normalization (RMSNorm)
โ””โ”€ k_norm 48 FP16 128 ~12 KB Key normalization (RMSNorm)
Router 48 FP16 262K ~48 MB Expert selection gate (2048โ†’128)
Coding Experts 48ร—9 FP16 ~12M ~5.5 GB High-precision coding experts
โ”œโ”€ gate_proj 432 FP16 1.6M ~1.3 GB Gate projection (2048โ†’768)
โ”œโ”€ up_proj 432 FP16 1.6M ~1.3 GB Up projection (2048โ†’768)
โ””โ”€ down_proj 432 FP16 1.6M ~1.3 GB Down projection (768โ†’2048)
Non-coding Experts 48ร—119 Q4 ~12M ~10.8 GB Memory-efficient general experts
โ”œโ”€ gate_proj 5,712 Q4 1.6M ~2.7 GB Gate projection (2048โ†’768)
โ”œโ”€ up_proj 5,712 Q4 1.6M ~2.7 GB Up projection (2048โ†’768)
โ””โ”€ down_proj 5,712 Q4 1.6M ~2.7 GB Down projection (768โ†’2048)
Layer Norms 96 FP16 2K ~0.4 MB RMSNorm layers (input + post-attn)
LM Head 1 FP16 311M ~0.6 GB Final token prediction (2048โ†’151936)

Total Parameters: ~30B (3.7B active per forward pass) Total Model Size: 22.32 GB

Precision Breakdown

Precision Components Total Size Percentage
FP16 Embeddings, Attention (all), Router, LM Head, Coding Experts, Norms ~16.3 GB 73%
Q4 Non-coding Experts only ~10.8 GB 48%
Overhead Scales, biases for Q4 ~0.8 GB 4%

Note: Percentages add to >100% because Q4 overhead included separately

Expert Distribution

Coding Experts (9 experts): FP16 for maximum code quality

Expert IDs: 21, 27, 31, 43, 59, 66, 71, 113, 126
Precision: FP16 (float16)
Size per expert: ~12.8 MB ร— 48 layers = ~614 MB
Total: 9 experts ร— 614 MB = ~5.5 GB

Non-coding Experts (119 experts): Q4 for memory efficiency

Expert IDs: 0-20, 22-30, 32-42, 44-58, 60-65, 67-70, 72-112, 114-127
Precision: Q4 (4-bit quantized with FP16 scales/biases)
Size per expert: ~1.9 MB ร— 48 layers = ~91 MB
Total: 119 experts ร— 91 MB = ~10.8 GB

Floating Point Formats

Format Bits Bytes Range Precision Compression Use Case
FP16 16 2 ยฑ65,504 ~3-4 digits 1ร— (baseline) Quality-critical components
Q4 4* 0.5* Dynamicโ€  ~2 digits 4ร— vs FP16 Non-critical components

*Plus FP16 scales/biases overhead (~3% of original size) โ€ Range determined by per-group scales

FP16 (Half Precision) Details:

  • Format: IEEE 754 binary16
  • Bit layout: 1 sign + 5 exponent + 10 mantissa
  • Range: ยฑ65,504 (subnormal: ยฑ6.10ร—10โปโต)
  • Precision: ~3.3 decimal digits (machine epsilon: 2โปยนโฐ โ‰ˆ 0.001)
  • Size: 2 bytes per parameter
  • Advantages:
    • Native GPU/NPU support
    • No quality loss vs FP32 for most ML tasks
    • Fast computation on Apple Silicon
  • Used for:
    • Attention layers (q/k/v/o projections)
    • Router gates
    • LM head
    • Coding experts (9)
    • All layer norms
    • Embeddings

Q4 (4-bit Quantized) Details:

  • Format: Grouped affine quantization
  • Method:
    quantized_value = round((value - bias) / scale)
    dequantized_value = quantized_value * scale + bias
    
  • Group size: 64 elements share one scale/bias pair
  • Storage:
    • Weights: 4 bits per element (packed into uint32)
    • Scales: FP16 (1 per 64 elements)
    • Biases: FP16 (1 per 64 elements)
  • Compression: ~4ร— vs FP16 (accounting for scales/biases overhead)
  • Quality impact: ~1-2% degradation on general tasks
  • Advantages:
    • 4ร— memory reduction
    • MLX has native Q4 kernels (gather_qmm)
    • Acceptable quality for non-coding experts
  • Used for:
    • Non-coding experts (119)
    • General knowledge tasks
    • Creative writing tasks
    • Non-technical content

Precision Selection Rationale

Component Chosen Precision Reason
Attention FP16 Long-range dependencies require precision
Router FP16 Accurate expert selection critical
LM Head FP16 Token probability distribution quality
Coding Experts FP16 Code syntax/structure needs precision
Non-Coding Experts Q4 General text tolerates quantization well
Layer Norms FP16 Normalization stability

Quality vs Size Tradeoff:

All FP16:  56 GB โ†’ Best quality, impractical for 32GB Mac
Hetero-v3: 22 GB โ†’ 95% of FP16 quality, fits in 32GB Mac โœ“
All Q4:    16 GB โ†’ 70% of FP16 quality, lowest memory

Intended Uses

Primary Use Cases

โœ… Code Generation

  • Writing Python, JavaScript, Java, C++, and other programming languages
  • Implementing algorithms and data structures
  • Code completion and refactoring
  • Debugging and code explanation

โœ… Technical Writing

  • API documentation
  • Technical tutorials
  • System design documents
  • Code comments and docstrings

โœ… General Text Generation

  • Question answering
  • Summarization
  • Creative writing
  • General conversation

Out-of-Scope Uses

โŒ Not Recommended For

  • Production systems without human oversight
  • Medical, legal, or financial advice
  • Real-time safety-critical applications
  • Generating harmful or misleading content

How to Use

Installation

# Install MLX and dependencies
pip install mlx mlx-lm transformers

# Or use uv (faster)
uv pip install mlx mlx-lm transformers

Quick Start

from qwen3_moe_hetero import load_hetero_v3
import mlx.core as mx

# Load model
print("Loading Qwen3-30B-MoE Hetero-v3...")
model, tokenizer = load_hetero_v3("./qwen3-30b-mlx-hetero-v3")

# Prepare prompt
prompt = "Write a Python function to compute fibonacci numbers:"
inputs = tokenizer(prompt, return_tensors="np")
input_ids = mx.array(inputs["input_ids"])

# Generate
cache = None
tokens = input_ids
max_tokens = 200

for i in range(max_tokens):
    # Forward pass
    logits, cache = model(
        tokens if cache is None else tokens[:, -1:],
        cache=cache
    )

    # Sample next token
    next_logits = logits[:, -1, :] / 1.0  # temperature
    probs = mx.softmax(next_logits, axis=-1)
    next_token = mx.random.categorical(mx.log(probs + 1e-10))

    # Append and evaluate
    next_token = mx.expand_dims(next_token, axis=0)
    tokens = mx.concatenate([tokens, next_token], axis=-1)
    mx.eval(tokens)

    # Check for EOS
    if next_token.item() == tokenizer.eos_token_id:
        break

# Decode output
output = tokenizer.decode(tokens[0].tolist(), skip_special_tokens=True)
print(output)

Using the CLI

# Using the provided MLX-compatible CLI
python mlx_lm_hetero_generate.py \
    --model ./qwen3-30b-mlx-hetero-v3 \
    --max-tokens 500 \
    --temp 1.0 \
    --prompt "Implement a thread-safe LRU cache in Python:"

CLI Options

--model          # Path to model directory (default: ./qwen3-30b-mlx-hetero-v3)
--prompt         # Input prompt (required)
--max-tokens     # Maximum tokens to generate (default: 100)
--temp           # Sampling temperature, 0=greedy (default: 0.7)
--top-p          # Top-p nucleus sampling (default: 0.9)
--verbose        # Show tokens as they're generated
--seed           # Random seed for reproducibility

Example Outputs

Coding Task:

Input: "Write a Python function to implement binary search:"

Output:
def binary_search(arr, target):
    """
    Perform binary search on a sorted array.

    Args:
        arr: Sorted list of comparable elements
        target: Element to search for

    Returns:
        Index of target if found, -1 otherwise
    """
    left, right = 0, len(arr) - 1

    while left <= right:
        mid = (left + right) // 2

        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1

    return -1

Training Details

Base Model

  • Base: Qwen/Qwen3-30B-A14B-MoE
  • Architecture: 128-expert Mixture-of-Experts
  • Parameters: ~30B total, ~3.7B active per token
  • Context Length: 40,960 tokens

Conversion Process

This model was created through heterogeneous quantization:

  1. Source Models:

    • Qwen3-30B-MoE-Q4: Q4 quantized version (for non-coding experts)
    • Qwen3-30B-MoE-Hetero-v2: FP16 coding experts source
  2. Quantization Strategy:

    • FP16 (no quantization): Attention, router, lm_head, coding experts (9)
    • Q4 (4-bit quantization): Non-coding experts (119)
    • Group size: 64 (for Q4 quantization)
  3. Expert Selection:

    • Coding experts identified through profiling on coding tasks
    • Expert IDs: 21, 27, 31, 43, 59, 66, 71, 113, 126
  4. Weight Organization:

    • Consolidated into 4 standard safetensors files
    • Standard MLX model format
    • Compatible with MLX tooling

Hardware Requirements

Component Minimum Recommended
RAM 28 GB 32 GB+
Storage 25 GB 30 GB
Platform Apple Silicon M1+ M2/M3/M4 Pro/Max/Ultra

Note: This model uses unified memory on Apple Silicon. 32GB+ recommended for comfortable usage.


Evaluation

Benchmarks

Compared against Qwen3-30B-MoE variants:

Model Size Coding Quality General Quality Speed (tok/s)
Q4 Baseline 17.62 GB Baseline (0%) Baseline (0%) ~20
Hetero-v2 20.55 GB +20% Similar ~20
Hetero-v3 22.32 GB +25-30% +10-15% ~21-28

Quality Improvements

Coding Tasks (Fibonacci, LRU Cache, Binary Search, etc.)

  • +25-30% improvement over Q4
  • +5-10% improvement over Hetero-v2
  • Better code structure, fewer syntax errors
  • More idiomatic implementations

General Knowledge (History, Science, Explanations)

  • +10-15% improvement over Q4
  • Better paragraph coherence (FP16 attention)
  • More accurate expert selection (FP16 router)

Creative Writing (Stories, Poetry, Dialogue)

  • +10-15% improvement over Q4
  • More natural word choices (FP16 lm_head)
  • Better narrative flow (FP16 attention)

Performance Metrics

Metric Value
First Token Latency 4-6 seconds
Subsequent Tokens 21-28 tok/sec
Memory Usage 22.32 GB (model) + 3 GB (overhead)
Prompt Processing ~50 tok/sec

Tested on M2 Max 96GB, measurements may vary by hardware


Limitations

Known Issues

  1. MLX-LM Compatibility: Requires custom loader due to cache format differences

    • Standard mlx_lm.generate() not yet supported
    • Use provided mlx_lm_hetero_generate.py CLI instead
  2. Memory Requirements: Requires 32GB+ unified memory

    • Will not run on 16GB or 24GB systems
    • Consider Q4 variant for memory-constrained setups
  3. First Token Latency: 4-6 seconds for first token

    • Due to KV cache initialization
    • Subsequent tokens are much faster (21-28 tok/sec)

Bias and Safety

โš ๏ธ Important: This model inherits biases from the base Qwen3 model:

  • May reflect biases present in training data
  • Can generate harmful or misleading content
  • Should not be used without human oversight
  • Not suitable for high-stakes decision making

Recommended: Always review and validate model outputs, especially for:

  • Code (security vulnerabilities, bugs)
  • Factual claims (hallucinations possible)
  • Sensitive topics (bias, fairness issues)

Comparison with Other Variants

vs. Qwen3-30B-MoE Q4

Hetero-v3 Advantages:

  • โœ… +25-30% better coding quality
  • โœ… +10-15% better general quality
  • โœ… FP16 attention for better coherence
  • โœ… FP16 router for better expert selection
  • โœ… Standard 4-file format

Hetero-v3 Tradeoffs:

  • โš ๏ธ +4.7 GB larger (22.32 GB vs 17.62 GB)
  • โš ๏ธ Requires custom CLI (not mlx_lm.generate())

vs. Hetero-v2

Hetero-v3 Advantages:

  • โœ… FP16 attention (vs Q4)
  • โœ… FP16 router (vs Q4)
  • โœ… FP16 lm_head (vs Q4)
  • โœ… 4 files (vs 97 files!)
  • โœ… Standard MLX format
  • โœ… +5-10% better quality

Hetero-v3 Tradeoffs:

  • โš ๏ธ +1.77 GB larger (22.32 GB vs 20.55 GB)

Verdict: Hetero-v3 is recommended over v2 for the significant improvements with minimal size increase.


File Structure

qwen3-30b-mlx-hetero-v3/
โ”œโ”€โ”€ model-00001-of-00004.safetensors  # 5.6 GB - Embeddings, early layers
โ”œโ”€โ”€ model-00002-of-00004.safetensors  # 5.6 GB - Middle layers
โ”œโ”€โ”€ model-00003-of-00004.safetensors  # 5.6 GB - Late layers
โ”œโ”€โ”€ model-00004-of-00004.safetensors  # 5.6 GB - Final layers, lm_head
โ”œโ”€โ”€ config.json                        # Model configuration
โ”œโ”€โ”€ tokenizer.json                     # Tokenizer configuration
โ”œโ”€โ”€ tokenizer_config.json             # Tokenizer settings
โ””โ”€โ”€ qwen3_moe_hetero.py               # Model implementation (required)

Total: 22.32 GB (model weights only)


Technical Specifications

Model Architecture

Qwen3MoeForCausalLM (30B parameters, 3.7B active)
โ”‚
โ”œโ”€โ”€ Embedding Layer
โ”‚   โ””โ”€โ”€ embed_tokens: [vocab_size=151936, hidden_size=2048] FP16
โ”‚       Size: 311M params ร— 2 bytes = 622 MB
โ”‚
โ”œโ”€โ”€ 48 ร— Transformer Layers (Layer 0-47)
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ Input LayerNorm
โ”‚   โ”‚   โ””โ”€โ”€ weight: [2048] FP16
โ”‚   โ”‚       Size: 2K params ร— 2 bytes = 4 KB per layer
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ Multi-Head Attention (32 heads, 4 KV heads, GQA)
โ”‚   โ”‚   โ”œโ”€โ”€ q_proj: [2048 โ†’ 4096] FP16
โ”‚   โ”‚   โ”‚   Size: 8.4M params ร— 2 bytes = 16.8 MB per layer
โ”‚   โ”‚   โ”œโ”€โ”€ k_proj: [2048 โ†’ 512] FP16
โ”‚   โ”‚   โ”‚   Size: 1.0M params ร— 2 bytes = 2.1 MB per layer
โ”‚   โ”‚   โ”œโ”€โ”€ v_proj: [2048 โ†’ 512] FP16
โ”‚   โ”‚   โ”‚   Size: 1.0M params ร— 2 bytes = 2.1 MB per layer
โ”‚   โ”‚   โ”œโ”€โ”€ o_proj: [4096 โ†’ 2048] FP16
โ”‚   โ”‚   โ”‚   Size: 8.4M params ร— 2 bytes = 16.8 MB per layer
โ”‚   โ”‚   โ”œโ”€โ”€ q_norm: [head_dim=128] FP16 (RMSNorm)
โ”‚   โ”‚   โ”‚   Size: 128 params ร— 2 bytes = 256 bytes per layer
โ”‚   โ”‚   โ””โ”€โ”€ k_norm: [head_dim=128] FP16 (RMSNorm)
โ”‚   โ”‚       Size: 128 params ร— 2 bytes = 256 bytes per layer
โ”‚   โ”‚   Total Attention: ~37.8 MB per layer ร— 48 = 1.8 GB
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ Post-Attention LayerNorm
โ”‚   โ”‚   โ””โ”€โ”€ weight: [2048] FP16
โ”‚   โ”‚       Size: 2K params ร— 2 bytes = 4 KB per layer
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ Sparse MoE Block (Top-8 of 128 experts)
โ”‚       โ”‚
โ”‚       โ”œโ”€โ”€ Router Gate
โ”‚       โ”‚   โ””โ”€โ”€ weight: [2048 โ†’ 128] FP16
โ”‚       โ”‚       Size: 262K params ร— 2 bytes = 524 KB per layer
โ”‚       โ”‚
โ”‚       โ”œโ”€โ”€ Coding Experts (9 experts: IDs 21,27,31,43,59,66,71,113,126)
โ”‚       โ”‚   โ”œโ”€โ”€ gate_proj: [9, 2048 โ†’ 768] FP16
โ”‚       โ”‚   โ”‚   Size: 9 ร— 1.6M params ร— 2 bytes = 28.8 MB per layer
โ”‚       โ”‚   โ”œโ”€โ”€ up_proj: [9, 2048 โ†’ 768] FP16
โ”‚       โ”‚   โ”‚   Size: 9 ร— 1.6M params ร— 2 bytes = 28.8 MB per layer
โ”‚       โ”‚   โ””โ”€โ”€ down_proj: [9, 768 โ†’ 2048] FP16
โ”‚       โ”‚       Size: 9 ร— 1.6M params ร— 2 bytes = 28.8 MB per layer
โ”‚       โ”‚   Total Coding Experts: 86.4 MB per layer ร— 48 = 4.1 GB
โ”‚       โ”‚
โ”‚       โ””โ”€โ”€ Non-Coding Experts (119 experts: remaining IDs)
โ”‚           โ”œโ”€โ”€ gate_proj: [119, 2048 โ†’ 768] Q4
โ”‚           โ”‚   Weight: 119 ร— 1.6M params ร— 0.5 bytes = 95.2 MB per layer
โ”‚           โ”‚   Scales: 119 ร— 25K groups ร— 2 bytes = 6.0 MB per layer
โ”‚           โ”‚   Biases: 119 ร— 25K groups ร— 2 bytes = 6.0 MB per layer
โ”‚           โ”‚   Total: 107.2 MB per layer
โ”‚           โ”œโ”€โ”€ up_proj: [119, 2048 โ†’ 768] Q4
โ”‚           โ”‚   Total: 107.2 MB per layer
โ”‚           โ””โ”€โ”€ down_proj: [119, 768 โ†’ 2048] Q4
โ”‚               Total: 107.2 MB per layer
โ”‚           Total Non-Coding Experts: 321.6 MB per layer ร— 48 = 15.4 GB
โ”‚
โ”œโ”€โ”€ Final LayerNorm
โ”‚   โ””โ”€โ”€ weight: [2048] FP16
โ”‚       Size: 2K params ร— 2 bytes = 4 KB
โ”‚
โ””โ”€โ”€ LM Head (Language Model Head)
    โ””โ”€โ”€ weight: [2048 โ†’ 151936] FP16
        Size: 311M params ร— 2 bytes = 622 MB

Total Model Size: 22.32 GB

Layer-by-Layer Breakdown

Each of the 48 transformer layers contains:

Component Shape Precision Size Cumulative
Input LayerNorm [2048] FP16 4 KB -
Attention q_proj [2048, 4096] FP16 16.8 MB 16.8 MB
Attention k_proj [2048, 512] FP16 2.1 MB 18.9 MB
Attention v_proj [2048, 512] FP16 2.1 MB 21.0 MB
Attention o_proj [4096, 2048] FP16 16.8 MB 37.8 MB
Attention q_norm [128] FP16 256 B 37.8 MB
Attention k_norm [128] FP16 256 B 37.8 MB
Post-Attn LayerNorm [2048] FP16 4 KB 37.8 MB
Router gate [2048, 128] FP16 524 KB 38.3 MB
Coding Experts (9ร—) 3 ร— [9,2048,768] FP16 86.4 MB 124.7 MB
Non-Coding Experts (119ร—) 3 ร— [119,2048,768] Q4 321.6 MB 446.3 MB
Total per layer - - ~446 MB -

Total for 48 layers: 446 MB ร— 48 = 21.4 GB Plus embeddings + LM head: 622 MB + 622 MB = 1.24 GB Grand total: 22.64 GB (slight overhead accounts for difference from 22.32 GB)

Precision Comparison Across Layers

Visual representation of what's FP16 vs Q4 in each layer:

Layer 0-47 (48 layers total):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Input LayerNorm                   [FP16]    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Attention Block:                            โ”‚
โ”‚  โ”œโ”€ q_proj (2048โ†’4096)           [FP16] โœ“  โ”‚
โ”‚  โ”œโ”€ k_proj (2048โ†’512)            [FP16] โœ“  โ”‚
โ”‚  โ”œโ”€ v_proj (2048โ†’512)            [FP16] โœ“  โ”‚
โ”‚  โ”œโ”€ o_proj (4096โ†’2048)           [FP16] โœ“  โ”‚
โ”‚  โ”œโ”€ q_norm (RMSNorm)             [FP16] โœ“  โ”‚
โ”‚  โ””โ”€ k_norm (RMSNorm)             [FP16] โœ“  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Post-Attention LayerNorm          [FP16]    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ MoE Block:                                  โ”‚
โ”‚  โ”œโ”€ Router gate (2048โ†’128)       [FP16] โœ“  โ”‚
โ”‚  โ”œโ”€ Coding Experts (9):                     โ”‚
โ”‚  โ”‚   โ”œโ”€ Expert 21 (all projs)   [FP16] โœ“  โ”‚
โ”‚  โ”‚   โ”œโ”€ Expert 27 (all projs)   [FP16] โœ“  โ”‚
โ”‚  โ”‚   โ”œโ”€ Expert 31 (all projs)   [FP16] โœ“  โ”‚
โ”‚  โ”‚   โ”œโ”€ Expert 43 (all projs)   [FP16] โœ“  โ”‚
โ”‚  โ”‚   โ”œโ”€ Expert 59 (all projs)   [FP16] โœ“  โ”‚
โ”‚  โ”‚   โ”œโ”€ Expert 66 (all projs)   [FP16] โœ“  โ”‚
โ”‚  โ”‚   โ”œโ”€ Expert 71 (all projs)   [FP16] โœ“  โ”‚
โ”‚  โ”‚   โ”œโ”€ Expert 113 (all projs)  [FP16] โœ“  โ”‚
โ”‚  โ”‚   โ””โ”€ Expert 126 (all projs)  [FP16] โœ“  โ”‚
โ”‚  โ””โ”€ Non-Coding Experts (119):              โ”‚
โ”‚      โ””โ”€ Experts 0-127 (except coding)      โ”‚
โ”‚         โ””โ”€ All projections        [Q4]  โ—†  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Legend:
  [FP16] โœ“ = Full precision (float16) - 2 bytes per param
  [Q4]   โ—† = 4-bit quantized - 0.5 bytes per param + FP16 scales/biases

Memory Layout Per Layer

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚      Component          โ”‚   Size   โ”‚  Format  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Attention (all)         โ”‚  37.8 MB โ”‚  FP16    โ”‚
โ”‚ Layer Norms (2)         โ”‚   8 KB   โ”‚  FP16    โ”‚
โ”‚ Router                  โ”‚  524 KB  โ”‚  FP16    โ”‚
โ”‚ Coding Experts (9)      โ”‚  86.4 MB โ”‚  FP16    โ”‚
โ”‚ Non-Coding Experts(119) โ”‚ 321.6 MB โ”‚  Q4      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Total per layer         โ”‚ ~446 MB  โ”‚  Mixed   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

FP16 portion: ~125 MB per layer (28%)
Q4 portion:   ~321 MB per layer (72%)

Quantization Details

FP16 Components (11.5 GB):

  • Attention layers: 48 ร— 4 projections ร— ~50MB = ~9.6 GB
  • Router: 48 ร— 1MB = ~48 MB
  • LM Head: ~300 MB
  • Coding Experts: 9 ร— 3 projections ร— 48 layers ร— ~4MB = ~5.2 GB
  • Embeddings & Norms: ~600 MB

Q4 Components (10.8 GB):

  • Non-Coding Experts: 119 ร— 3 projections ร— 48 layers ร— ~1.5MB = ~10.8 GB
    • Stored as: quantized weights (uint32) + scales (FP16) + biases (FP16)
    • Group size: 64
    • Effective compression: ~4x vs FP16

Implementation Files

The model requires custom implementation files:

  1. qwen3_moe_hetero.py - Model definition

    • Model class (main model)
    • HeteroSwitchGLU (mixed-precision MoE)
    • load_hetero_v3() loader function
  2. mlx_lm_hetero_generate.py - CLI tool

    • MLX-compatible generation CLI
    • Same interface as mlx_lm generate

Download from: [GitHub Repository Link]


Citation

If you use this model, please cite:

@misc{qwen3-hetero-v3,
  title={Qwen3-30B-MoE Hetero-v3: Heterogeneous Quantization for Apple Silicon},
  author={[Your Name]},
  year={2024},
  howpublished={https://huggingface.co/[your-username]/qwen3-30b-mlx-hetero-v3}
}

And the base model:

@article{qwen3,
  title={Qwen3 Technical Report},
  author={Qwen Team},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2024}
}

License

This model is released under the Apache 2.0 License, same as the base Qwen3 model.

Terms:

  • โœ… Commercial use allowed
  • โœ… Modification allowed
  • โœ… Distribution allowed
  • โœ… Private use allowed
  • โš ๏ธ Must include license and copyright notice
  • โš ๏ธ Must state changes made

See Apache 2.0 License for full terms.


Acknowledgments

  • Qwen Team for the excellent base model
  • MLX Team at Apple for the MLX framework
  • Anthropic for Claude (used in development and documentation)

Contact & Support


Version History

v3.0 (Current)

  • FP16 attention layers (improved from Q4 in v2)
  • FP16 router (improved from Q4 in v2)
  • FP16 lm_head (improved from Q4 in v2)
  • Standard 4-file format (improved from 97 files in v2)
  • MLX-compatible CLI tool
  • +25-30% coding quality improvement over Q4

v2.0

  • FP16 coding experts (9)
  • Q4 non-coding experts (119)
  • Q4 attention/router/lm_head
  • 97-file custom format
  • +20% coding quality improvement over Q4

v1.0 (Q4 Baseline)

  • Full Q4 quantization
  • 49-file standard format
  • Standard mlx_lm compatible

Built with โค๏ธ for Apple Silicon
Optimized for M1/M2/M3/M4 chips using MLX

Downloads last month
45
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support