Instructions to use shibatch/tinygemma4moe5m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use shibatch/tinygemma4moe5m with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="shibatch/tinygemma4moe5m")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("shibatch/tinygemma4moe5m", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use shibatch/tinygemma4moe5m with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "shibatch/tinygemma4moe5m" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shibatch/tinygemma4moe5m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/shibatch/tinygemma4moe5m
- SGLang
How to use shibatch/tinygemma4moe5m with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "shibatch/tinygemma4moe5m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shibatch/tinygemma4moe5m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "shibatch/tinygemma4moe5m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shibatch/tinygemma4moe5m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use shibatch/tinygemma4moe5m with Docker Model Runner:
docker model run hf.co/shibatch/tinygemma4moe5m
Tiny Gemma4 MoE Text
This repository contains a tiny Gemma4 text-only Mixture-of-Experts causal language model for validation and debugging.
The model is intentionally small. It is not intended to be a high-quality text generation model. Its main purpose is to provide a compact checkpoint that exercises Gemma4 MoE text-model code paths in Hugging Face Transformers.
This checkpoint is useful for implementation testing because it includes both sliding attention and full attention layers, grouped-query attention, per-layer input embeddings, and MoE routing with multiple experts.
Model purpose
This model is designed for:
- testing
Gemma4ForCausalLM - validating
Gemma4TextConfig - testing Gemma4 text-only MoE model loading
- checking model save/load behavior
- checking tokenizer save/load behavior
- exercising sliding attention layers
- exercising full attention layers
- exercising grouped-query attention
- exercising Gemma4 per-layer input embedding paths
- exercising MoE expert parameters
- exercising top-k expert routing
- providing a compact Gemma4 MoE checkpoint for inference-engine validation
It is not designed for:
- high-quality story generation
- instruction following
- chat use
- OCR
- multimodal inference
- benchmark comparison against production language models
- production deployment
Model architecture
The model uses Gemma4ForCausalLM with a small Gemma4 text MoE configuration.
Representative configuration:
model_type: gemma4_text
vocab_size: 1024
vocab_size_per_layer_input: 1024
hidden_size: 160
hidden_size_per_layer_input: 24
intermediate_size: 320
moe_intermediate_size: 320
num_hidden_layers: 6
num_attention_heads: 5
num_key_value_heads: 1
num_global_key_value_heads: 1
head_dim: 32
global_head_dim: 32
sliding_window: 128
max_position_embeddings: 1024
layer_types:
- sliding_attention
- sliding_attention
- full_attention
- sliding_attention
- sliding_attention
- full_attention
hidden_activation: gelu_pytorch_tanh
tie_word_embeddings: true
attention_bias: false
rms_norm_eps: 1e-06
enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2
use_double_wide_mlp: false
pad_token_id: 2
bos_token_id: 0
eos_token_id: 1
The attention pattern is:
ssFssF
where s means sliding_attention and F means full_attention.
This pattern was chosen for validation coverage. A full-attention-only model may be easier to train, but it would not exercise the sliding attention path.
MoE configuration
This model enables Gemma4 MoE blocks.
enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2
moe_intermediate_size: 320
The num_experts=4 and top_k_experts=2 setting is intentional. A smaller configuration such as num_experts=2, top_k=1 would exercise only a much simpler routing path. This checkpoint is intended to cover:
router / gate parameters
multiple experts
top-2 expert selection
weighted expert combination
MoE FFN parameters
dense and MoE layer interaction
Training data
The model was trained on TinyStories-style English story text.
The tokenizer is a small byte-level BPE tokenizer with a vocabulary size of 1024. The small vocabulary keeps the checkpoint compact, but it also limits text generation quality.
Training setup
Representative training settings:
num_epochs: 1
learning_rate: 2e-4
batch_size: 32
block_size: 256
vocab_size: 1024
hidden_size: 160
intermediate_size: 640
moe_intermediate_size: 320
num_hidden_layers: 6
num_attention_heads: 5
num_key_value_heads: 1
head_dim: 32
hidden_size_per_layer_input: 24
layer_pattern: ssFssF
sliding_window: 128
enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2
The final evaluation loss in the reference run was approximately:
Final loss: 2.4662
This loss should not be interpreted as a quality benchmark. The model is very small and includes Gemma4 MoE-specific architectural paths primarily for validation coverage.
Example generation
Example output from the reference checkpoint:
Prompt: Once upon
Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mom said, "Lily, you can't play with your toys. It's not a toy. It's a big, big, big house."
Lily was sad and said, "I'm sorry, Lily. I'm sorry, I'm sorry.
Prompt: There was a little
There was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mom said, "Lily, you can't play with your toys. It's not a toy. It's a big toys."
Lily said, "I want to play with it. It's a big toys. It's a big toys. It's a big toys."
Lily
Prompt: One day
One day, a little girl named Lily went to the park with her mom. She saw a big, big tree and wanted to play with it. She saw a big, big tree and wanted to play with it. She saw a big tree and wanted to play with it.
"Hello, little girl!" said her mom. "I want to play with you!"
The little
The model can generate TinyStories-like text fragments, but repetition and template collapse are expected. This is normal for this checkpoint and is not considered a failure for its intended purpose.
Usage
If the model files are stored under an hf/ subdirectory, use the following example.
import torch
from transformers import PreTrainedTokenizerFast, Gemma4ForCausalLM
repo = "shibatch/tinygemma4moe5m"
tokenizer = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
model = Gemma4ForCausalLM.from_pretrained(
repo,
subfolder="hf",
torch_dtype=torch.float32,
)
model.eval()
prompt = "Once upon"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=100,
do_sample=False,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
Loading requirements
This checkpoint requires a Transformers version that supports Gemma4 and Gemma4 MoE.
The following imports should work:
from transformers import Gemma4ForCausalLM, Gemma4TextConfig
If these imports fail, update Transformers to a version with Gemma4 support.
Tokenizer note
This repository uses a custom byte-level BPE tokenizer saved as a PreTrainedTokenizerFast.
For this reason, examples use:
from transformers import PreTrainedTokenizerFast
instead of AutoTokenizer.
Using AutoTokenizer may fail in some environments if the tokenizer backend cannot be inferred automatically.
The expected tokenizer files include:
tokenizer.json
tokenizer_config.json
special_tokens_map.json
Intended validation coverage
This checkpoint is intended to validate support for:
Gemma4TextConfig
Gemma4ForCausalLM
sliding_attention layers
full_attention layers
GQA with num_key_value_heads = 1
global key/value head configuration
per-layer input embeddings
tied word embeddings
Gemma4 RMSNorm behavior
Gemma4 MLP activation: gelu_pytorch_tanh
Gemma4 MoE expert parameters
num_experts = 4
top_k_experts = 2
expert_interval = 2
MoE expert dispatch
MoE expert output combination
generate()
save_pretrained()
from_pretrained()
Limitations
This is a tiny debug model. It should not be used as a general-purpose language model.
Known limitations:
- frequent phrase repetition
- TinyStories template collapse
- weak long-form coherence
- small vocabulary
- weak semantic consistency
- no instruction tuning
- no chat formatting
- no multimodal capability
- no OCR capability
- no production use
The checkpoint is primarily intended to make Gemma4 MoE text-model code paths easy to test without downloading a large model.
Why include MoE?
A dense tiny Gemma4 model is simpler to train, but it does not cover MoE-specific implementation paths.
This checkpoint intentionally includes:
num_experts = 4
top_k_experts = 2
to exercise a more realistic MoE routing path than a minimal top_k=1 configuration.
Why not full attention only?
A full-attention-only tiny model may train more cleanly, but it would not cover Gemma4 sliding attention behavior.
This checkpoint uses:
sliding_attention
sliding_attention
full_attention
sliding_attention
sliding_attention
full_attention
to cover both attention implementations.
Notes on OCR and multimodal use
This repository is text-only. It does not include a vision tower, image projector, image-token alignment, or OCR training.
A Gemma4 OCR or Gemma4 MoE OCR validation model would be a separate project. It would require a tiny multimodal Gemma4 configuration, a synthetic OCR dataset, image-token handling, vision/text alignment, OCR fine-tuning, and additional validation scripts.
Suggested repository name
Suggested Hugging Face repository name:
shibatch/tinygemma4moe5m
Citation
This is a synthetic tiny validation checkpoint derived from Gemma4-compatible MoE text architecture settings. It is intended for debugging and implementation testing.