Tiny Qwen3 Gated MoE 3M

This repository contains a tiny Qwen3 MoE causal language model for validation and debugging.

The model is intentionally small. It is not intended to be a high-quality text generation model. Its main purpose is to provide a compact checkpoint that exercises Qwen3-style routed expert and shared expert MoE implementation paths.

Repository name:

shibatch/tinyqwen3gatedmoe3m

Model purpose

This model is designed for:

Qwen3 MoE loader tests
Qwen3MoeForCausalLM inference tests
routed expert dispatch validation
top-k expert routing validation
shared expert path validation
small end-to-end generation checks
tokenizer save/load checks
model conversion tests
inference engine regression tests

It is not designed for:

high-quality story generation
chat use
instruction following
OCR
multimodal inference
production deployment
benchmark comparison with full-size Qwen3 models

Relation to Qwen3 MoE variants

There are two closely related MoE styles in the Qwen3 family.

The first is the standard Qwen3 MoE style. In this design, the dense FFN block is replaced by sparse routed experts. A router selects a small number of experts for each token, and only those selected experts are activated.

The second is the Qwen3-Next style. Qwen3-Next also uses sparse routed experts, but combines them with a shared expert path. In other words, the MoE block has both token-routed experts and a shared expert component.

This tiny model is intended to cover the second style at validation scale:

sparse routed experts + shared expert

Conceptually:

standard Qwen3 MoE:
  sparse routed experts

Qwen3-Next-style MoE:
  sparse routed experts + shared expert

this tiny model:
  sparse routed experts + shared expert, scaled down for validation

The purpose of this checkpoint is not to reproduce Qwen3-Next scale or quality. It is a compact checkpoint for testing routed-expert and shared-expert implementation paths.

Architecture

This checkpoint uses Hugging Face Qwen3MoeForCausalLM with a tiny Qwen3MoeConfig.

Representative configuration:

model_type: qwen3_moe
architectures:
  - Qwen3MoeForCausalLM

vocab_size: 1024
hidden_size: 128
intermediate_size: 512
moe_intermediate_size: 256
shared_expert_intermediate_size: 128

num_hidden_layers: 6
num_attention_heads: 4
num_key_value_heads: 2
head_dim: 32

max_position_embeddings: 512
attention_bias: true
attention_dropout: 0.0
hidden_act: silu
rms_norm_eps: 1e-06
tie_word_embeddings: true

decoder_sparse_step: 1
mlp_only_layers: []

num_local_experts: 4
num_experts_per_tok: 2
norm_topk_prob: true
router_aux_loss_coef: 0.001

use_sliding_window: false
sliding_window: null

bos_token_id: 1000
eos_token_id: 1001
pad_token_id: 1000

The important MoE configuration is:

num_local_experts: 4
num_experts_per_tok: 2
shared_expert_intermediate_size: 128
moe_intermediate_size: 256
decoder_sparse_step: 1

This means each token is routed to two experts out of four local experts, while the shared expert path is also present.

What this model covers

Compared with a dense Qwen3 model, this checkpoint exercises:

router logits
top-2 expert selection
expert routing weights
multiple local experts
shared expert path
MoE FFN parameters
GQA with num_key_value_heads = 2
tied word embeddings
Qwen3 RMSNorm behavior
Qwen3 MoE save/load behavior
Qwen3 MoE generate() behavior

This makes the checkpoint useful for small implementation tests where full-size Qwen3 MoE or Qwen3-Next checkpoints would be too large.

Parameter count

The reference checkpoint is approximately 3M-scale. The exact value should be checked from the generated artifact_metadata.json or training log.

The training script prints:

Parameter count: ...
MoE/router/expert/gate parameter names found: ...
Prefix breakdown:
  ...

Tokenizer

This model uses a custom byte-level BPE tokenizer.

The tokenizer is intentionally aligned with an earlier tiny Qwen3 MoE training recipe:

base_vocab_size: 1000
special tokens added after BPE training:
  <s>
  </s>
  <|im_start|>

expected ids:
  <s>: 1000
  </s>: 1001
  <|im_start|>: 1002

The model uses:

bos_token: <s>
eos_token: </s>
pad_token: <s>

This tokenizer layout is part of the validation artifact. It should not be replaced with a different tokenizer if exact checkpoint behavior is important.

Training setup

Representative training settings:

dataset: TinyStories-style English text
num_epochs: 1
learning_rate: 1e-3
batch_size: 32
block_size: 256
dtype: bf16 or auto mixed precision depending on hardware

legacy tokenizer: enabled
frozen random q/k/v bias: enabled
qkv_bias_init_range: 0.2

The frozen q/k/v bias initialization is intentionally inherited from the earlier tiny Qwen3 MoE training recipe. It is unusual for a normal model, but useful here because this checkpoint is a tiny validation artifact rather than a production model.

Reference result

The reference run completed one epoch of TinyStories-style training.

Final loss: 1.657723343372345

Example generations from the reference checkpoint:

Prompt: Once upon

Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she went to the park to play with her friends. She saw a big, scary dog and wanted to play too.

Lily's friend, Timmy, came over to play and saw the dog. He said, "Wow, that's a nice dog! Can I play with you?"

Lily said, "Sure, but be careful
Prompt: There was a little

There was a little girl named Lily. She was very excited to go to the park and play with her friends. She was so excited to go on the swings and slide down the slide.

As she was playing, she saw a big slide and wanted to go on it. She ran over to it and started to climb. But then, she saw a big slide and she was so excited. She ran over to it and started to climb.

But
Prompt: One day

One day, Lily's mom asked her to help clean the house. Lily was happy to help and started to clean the house. She put the paper on the floor and put it in the dirt.

After a while, Lily's mom came home and saw the paper. She was very happy and said, "Lily, you are so kind and helpful. You are a good helper." Lily smiled and said, "Thank you, mom

The model can generate TinyStories-like text fragments, but shallow story structure, repetition, and occasional semantic errors are expected. This is normal for this checkpoint and is not considered a failure for its intended validation purpose.

Usage

If the model files are stored under an hf/ subdirectory, use:

import torch
from transformers import PreTrainedTokenizerFast, Qwen3MoeForCausalLM

repo = "shibatch/tinyqwen3gatedmoe3m"

tokenizer = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
model = Qwen3MoeForCausalLM.from_pretrained(
    repo,
    subfolder="hf",
    torch_dtype=torch.float32,
)
model.eval()

prompt = "Once upon"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=False,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Loading requirements

This checkpoint requires a Transformers version that supports Qwen3 MoE.

The following imports should work:

from transformers import Qwen3MoeConfig, Qwen3MoeForCausalLM

If these imports fail, update Transformers to a version with Qwen3 MoE support.

Tokenizer note

This repository uses a custom byte-level BPE tokenizer saved as a PreTrainedTokenizerFast.

For this reason, examples use:

from transformers import PreTrainedTokenizerFast

instead of relying on AutoTokenizer.

Using AutoTokenizer may fail in some environments if the tokenizer backend cannot be inferred automatically.

Expected tokenizer files include:

tokenizer.json
tokenizer_config.json
special_tokens_map.json

Limitations

This is a tiny debug model. It should not be used as a general-purpose language model.

Known limitations:

small vocabulary
weak long-form coherence
shallow TinyStories-style structure
occasional repetition
occasional semantic errors
no instruction tuning
no chat formatting
no multimodal capability
no OCR capability
no production use

The checkpoint is primarily intended to make Qwen3 routed-expert and shared-expert MoE paths easy to test without downloading a large model.

Why this model exists

Full-size Qwen3 MoE and Qwen3-Next models are too large for many unit tests, loader tests, and small inference-engine regression tests.

This checkpoint provides a tiny model that still contains the important MoE features:

routed experts
top-2 expert selection
shared expert path
GQA
Qwen3 MoE config
Qwen3MoeForCausalLM load/generate/save path

It is intended to be small enough for quick testing while still being structurally meaningful.

Suggested repository name

shibatch/tinyqwen3gatedmoe3m

Citation

This is a synthetic tiny validation checkpoint derived from Qwen3 MoE-compatible architecture settings. It is intended for debugging and implementation testing.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support