Qwen3 MoE (Mixture of Experts) โ€” 39M Parameters

Custom Qwen3 MoE model trained with pipeline parallelism.

Model Details

Property Value
Total Parameters 39,388,928
Architecture MoE (Mixture of Experts)
Hidden Size 128
Num Layers 2
Attention Heads 4
Context Length 512
Vocab Size 151,936
Num Experts 4
Top-K Experts 2
MoE Hidden Dim 128

Evaluation Results

Metric Value
val_loss 0.3217
val_perplexity 1.3795
train_loss 0.2418
step 2000

Usage

import torch
from safetensors.torch import load_file

# Load model weights
state_dict = load_file("model.safetensors")

Training

Trained using pipeline parallelism with the multi_gpu_pretraining framework.

Downloads last month
52
Safetensors
Model size
39.4M params
Tensor type
F32
ยท
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support