Instructions to use shibatch/tinyqwen3gatedmoe3m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use shibatch/tinyqwen3gatedmoe3m with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="shibatch/tinyqwen3gatedmoe3m")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("shibatch/tinyqwen3gatedmoe3m", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use shibatch/tinyqwen3gatedmoe3m with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "shibatch/tinyqwen3gatedmoe3m" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shibatch/tinyqwen3gatedmoe3m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/shibatch/tinyqwen3gatedmoe3m
- SGLang
How to use shibatch/tinyqwen3gatedmoe3m with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "shibatch/tinyqwen3gatedmoe3m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shibatch/tinyqwen3gatedmoe3m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "shibatch/tinyqwen3gatedmoe3m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shibatch/tinyqwen3gatedmoe3m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use shibatch/tinyqwen3gatedmoe3m with Docker Model Runner:
docker model run hf.co/shibatch/tinyqwen3gatedmoe3m
Tiny Qwen3 Gated MoE 3M
This repository contains a tiny Qwen3 MoE causal language model for validation and debugging.
The model is intentionally small. It is not intended to be a high-quality text generation model. Its main purpose is to provide a compact checkpoint that exercises Qwen3-style routed expert and shared expert MoE implementation paths.
Repository name:
shibatch/tinyqwen3gatedmoe3m
Model purpose
This model is designed for:
Qwen3 MoE loader tests
Qwen3MoeForCausalLM inference tests
routed expert dispatch validation
top-k expert routing validation
shared expert path validation
small end-to-end generation checks
tokenizer save/load checks
model conversion tests
inference engine regression tests
It is not designed for:
high-quality story generation
chat use
instruction following
OCR
multimodal inference
production deployment
benchmark comparison with full-size Qwen3 models
Relation to Qwen3 MoE variants
There are two closely related MoE styles in the Qwen3 family.
The first is the standard Qwen3 MoE style. In this design, the dense FFN block is replaced by sparse routed experts. A router selects a small number of experts for each token, and only those selected experts are activated.
The second is the Qwen3-Next style. Qwen3-Next also uses sparse routed experts, but combines them with a shared expert path. In other words, the MoE block has both token-routed experts and a shared expert component.
This tiny model is intended to cover the second style at validation scale:
sparse routed experts + shared expert
Conceptually:
standard Qwen3 MoE:
sparse routed experts
Qwen3-Next-style MoE:
sparse routed experts + shared expert
this tiny model:
sparse routed experts + shared expert, scaled down for validation
The purpose of this checkpoint is not to reproduce Qwen3-Next scale or quality. It is a compact checkpoint for testing routed-expert and shared-expert implementation paths.
Architecture
This checkpoint uses Hugging Face Qwen3MoeForCausalLM with a tiny Qwen3MoeConfig.
Representative configuration:
model_type: qwen3_moe
architectures:
- Qwen3MoeForCausalLM
vocab_size: 1024
hidden_size: 128
intermediate_size: 512
moe_intermediate_size: 256
shared_expert_intermediate_size: 128
num_hidden_layers: 6
num_attention_heads: 4
num_key_value_heads: 2
head_dim: 32
max_position_embeddings: 512
attention_bias: true
attention_dropout: 0.0
hidden_act: silu
rms_norm_eps: 1e-06
tie_word_embeddings: true
decoder_sparse_step: 1
mlp_only_layers: []
num_local_experts: 4
num_experts_per_tok: 2
norm_topk_prob: true
router_aux_loss_coef: 0.001
use_sliding_window: false
sliding_window: null
bos_token_id: 1000
eos_token_id: 1001
pad_token_id: 1000
The important MoE configuration is:
num_local_experts: 4
num_experts_per_tok: 2
shared_expert_intermediate_size: 128
moe_intermediate_size: 256
decoder_sparse_step: 1
This means each token is routed to two experts out of four local experts, while the shared expert path is also present.
What this model covers
Compared with a dense Qwen3 model, this checkpoint exercises:
router logits
top-2 expert selection
expert routing weights
multiple local experts
shared expert path
MoE FFN parameters
GQA with num_key_value_heads = 2
tied word embeddings
Qwen3 RMSNorm behavior
Qwen3 MoE save/load behavior
Qwen3 MoE generate() behavior
This makes the checkpoint useful for small implementation tests where full-size Qwen3 MoE or Qwen3-Next checkpoints would be too large.
Parameter count
The reference checkpoint is approximately 3M-scale. The exact value should be checked from the generated artifact_metadata.json or training log.
The training script prints:
Parameter count: ...
MoE/router/expert/gate parameter names found: ...
Prefix breakdown:
...
Tokenizer
This model uses a custom byte-level BPE tokenizer.
The tokenizer is intentionally aligned with an earlier tiny Qwen3 MoE training recipe:
base_vocab_size: 1000
special tokens added after BPE training:
<s>
</s>
<|im_start|>
expected ids:
<s>: 1000
</s>: 1001
<|im_start|>: 1002
The model uses:
bos_token: <s>
eos_token: </s>
pad_token: <s>
This tokenizer layout is part of the validation artifact. It should not be replaced with a different tokenizer if exact checkpoint behavior is important.
Training setup
Representative training settings:
dataset: TinyStories-style English text
num_epochs: 1
learning_rate: 1e-3
batch_size: 32
block_size: 256
dtype: bf16 or auto mixed precision depending on hardware
legacy tokenizer: enabled
frozen random q/k/v bias: enabled
qkv_bias_init_range: 0.2
The frozen q/k/v bias initialization is intentionally inherited from the earlier tiny Qwen3 MoE training recipe. It is unusual for a normal model, but useful here because this checkpoint is a tiny validation artifact rather than a production model.
Reference result
The reference run completed one epoch of TinyStories-style training.
Final loss: 1.657723343372345
Example generations from the reference checkpoint:
Prompt: Once upon
Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she went to the park to play with her friends. She saw a big, scary dog and wanted to play too.
Lily's friend, Timmy, came over to play and saw the dog. He said, "Wow, that's a nice dog! Can I play with you?"
Lily said, "Sure, but be careful
Prompt: There was a little
There was a little girl named Lily. She was very excited to go to the park and play with her friends. She was so excited to go on the swings and slide down the slide.
As she was playing, she saw a big slide and wanted to go on it. She ran over to it and started to climb. But then, she saw a big slide and she was so excited. She ran over to it and started to climb.
But
Prompt: One day
One day, Lily's mom asked her to help clean the house. Lily was happy to help and started to clean the house. She put the paper on the floor and put it in the dirt.
After a while, Lily's mom came home and saw the paper. She was very happy and said, "Lily, you are so kind and helpful. You are a good helper." Lily smiled and said, "Thank you, mom
The model can generate TinyStories-like text fragments, but shallow story structure, repetition, and occasional semantic errors are expected. This is normal for this checkpoint and is not considered a failure for its intended validation purpose.
Usage
If the model files are stored under an hf/ subdirectory, use:
import torch
from transformers import PreTrainedTokenizerFast, Qwen3MoeForCausalLM
repo = "shibatch/tinyqwen3gatedmoe3m"
tokenizer = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
model = Qwen3MoeForCausalLM.from_pretrained(
repo,
subfolder="hf",
torch_dtype=torch.float32,
)
model.eval()
prompt = "Once upon"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=100,
do_sample=False,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
Loading requirements
This checkpoint requires a Transformers version that supports Qwen3 MoE.
The following imports should work:
from transformers import Qwen3MoeConfig, Qwen3MoeForCausalLM
If these imports fail, update Transformers to a version with Qwen3 MoE support.
Tokenizer note
This repository uses a custom byte-level BPE tokenizer saved as a PreTrainedTokenizerFast.
For this reason, examples use:
from transformers import PreTrainedTokenizerFast
instead of relying on AutoTokenizer.
Using AutoTokenizer may fail in some environments if the tokenizer backend cannot be inferred automatically.
Expected tokenizer files include:
tokenizer.json
tokenizer_config.json
special_tokens_map.json
Limitations
This is a tiny debug model. It should not be used as a general-purpose language model.
Known limitations:
small vocabulary
weak long-form coherence
shallow TinyStories-style structure
occasional repetition
occasional semantic errors
no instruction tuning
no chat formatting
no multimodal capability
no OCR capability
no production use
The checkpoint is primarily intended to make Qwen3 routed-expert and shared-expert MoE paths easy to test without downloading a large model.
Why this model exists
Full-size Qwen3 MoE and Qwen3-Next models are too large for many unit tests, loader tests, and small inference-engine regression tests.
This checkpoint provides a tiny model that still contains the important MoE features:
routed experts
top-2 expert selection
shared expert path
GQA
Qwen3 MoE config
Qwen3MoeForCausalLM load/generate/save path
It is intended to be small enough for quick testing while still being structurally meaningful.
Suggested repository name
shibatch/tinyqwen3gatedmoe3m
Citation
This is a synthetic tiny validation checkpoint derived from Qwen3 MoE-compatible architecture settings. It is intended for debugging and implementation testing.