Instructions to use shibatch/tinygemma4moe3m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use shibatch/tinygemma4moe3m with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="shibatch/tinygemma4moe3m")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("shibatch/tinygemma4moe3m", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use shibatch/tinygemma4moe3m with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "shibatch/tinygemma4moe3m" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shibatch/tinygemma4moe3m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/shibatch/tinygemma4moe3m
- SGLang
How to use shibatch/tinygemma4moe3m with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "shibatch/tinygemma4moe3m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shibatch/tinygemma4moe3m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "shibatch/tinygemma4moe3m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shibatch/tinygemma4moe3m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use shibatch/tinygemma4moe3m with Docker Model Runner:
docker model run hf.co/shibatch/tinygemma4moe3m
- Tiny Gemma4 MoE Text 3M
- Model purpose
- Model architecture
- MoE configuration
- Tokenizer
- Training data
- Training setup
- Example generation
- Usage
- Loading requirements
- Tokenizer loading note
- Intended validation coverage
- Limitations
- Why include MoE?
- Why not full attention only?
- Why a legacy tokenizer?
- Notes on OCR and multimodal use
- Suggested repository name
- Citation
- Model purpose
Tiny Gemma4 MoE Text 3M
This repository contains an approximately 3M-parameter tiny Gemma4 text-only Mixture-of-Experts causal language model for validation and debugging.
The model is intentionally small. It is not intended to be a high-quality text generation model. Its main purpose is to provide a compact checkpoint that exercises Gemma4 MoE text-model code paths in Hugging Face Transformers and in independent inference engines.
This checkpoint is useful for implementation testing because it includes sliding attention layers, full attention layers, grouped-query attention, per-layer input embeddings, and MoE routing with multiple experts. Compared with the previous larger tiny checkpoint, this 3M version is intended to be the recommended compact validation checkpoint.
Model purpose
This model is designed for:
- testing
Gemma4ForCausalLM - validating
Gemma4TextConfig - testing Gemma4 text-only MoE model loading
- checking model save/load behavior
- checking tokenizer save/load behavior
- exercising sliding attention layers
- exercising full attention layers
- exercising grouped-query attention
- exercising Gemma4 per-layer input embedding paths
- exercising MoE expert parameters
- exercising top-k expert routing
- providing a compact Gemma4 MoE checkpoint for inference-engine validation
It is not designed for:
- high-quality story generation
- instruction following
- chat use
- OCR
- multimodal inference
- benchmark comparison against production language models
- production deployment
Model architecture
The model uses Gemma4ForCausalLM with a small Gemma4 text MoE configuration.
Representative configuration:
model_type: gemma4_text
vocab_size: 1024
vocab_size_per_layer_input: 1024
hidden_size: 128
hidden_size_per_layer_input: 16
intermediate_size: 384
intermediate_dim: 192
moe_intermediate_size: 192
num_hidden_layers: 6
num_attention_heads: 4
num_key_value_heads: 1
num_global_key_value_heads: 1
head_dim: 32
global_head_dim: 32
sliding_window: 128
max_position_embeddings: 1024
layer_types:
- sliding_attention
- sliding_attention
- full_attention
- sliding_attention
- sliding_attention
- full_attention
hidden_activation: gelu_pytorch_tanh
tie_word_embeddings: true
attention_bias: false
attention_dropout: 0.0
rms_norm_eps: 1e-06
initializer_range: 0.02
use_cache: true
final_logit_softcapping: null
use_bidirectional_attention: null
attention_k_eq_v: false
num_kv_shared_layers: 0
use_double_wide_mlp: false
enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2
router_aux_loss_coef: 0.0
pad_token_id: 1000
bos_token_id: 1000
eos_token_id: 1001
The attention pattern is:
ssFssF
where s means sliding_attention and F means full_attention.
This pattern was chosen for validation coverage. A full-attention-only model may be easier to train, but it would not exercise the sliding attention path.
MoE configuration
This model enables Gemma4 MoE blocks.
enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2
moe_intermediate_size: 192
intermediate_dim: 192
The num_experts=4 and top_k_experts=2 setting is intentional. A smaller configuration such as num_experts=2, top_k=1 would exercise only a much simpler routing path. This checkpoint is intended to cover:
router / gate parameters
multiple experts
top-2 expert selection
weighted expert combination
MoE FFN parameters
dense and MoE layer interaction
Tokenizer
The model uses a small legacy-style byte-level BPE tokenizer.
The tokenizer is trained with:
RawTokenizer(BPE())
ByteLevel(add_prefix_space=False)
ByteLevelDecoder()
BpeTrainer(
vocab_size=1000,
min_frequency=2,
special_tokens=[],
initial_alphabet=ByteLevel.alphabet(),
)
After BPE training, the following special tokens are added:
<s> id 1000
</s> id 1001
<|im_start|> id 1002
The model config keeps vocab_size=1024, leaving a small reserved range above the actual tokenizer size. The pad token is set to <s>, so pad_token_id and bos_token_id are both 1000.
This tokenizer setup was chosen because the tiny model trained substantially better with this legacy byte-level configuration than with a standard ByteLevelBPETokenizer setup that trains special tokens directly into the vocabulary.
Training data
The model was trained on TinyStories-style English story text.
The small vocabulary keeps the checkpoint compact, but it also limits text generation quality. The model often learns common TinyStories templates, especially stories about Lily, her mom, parks, apples, slides, toys, and simple moral situations.
Training setup
Representative training settings:
num_epochs: 1
learning_rate: 2e-4
batch_size: 32
block_size: 256
max_steps: derived from one epoch
device: auto
dtype: float32 by default
grad_clip: 1.0
error_on_nonfinite_gradients: true
weight_decay: 0.0
seed: 1234
vocab_size: 1024
base_vocab_size: 1000
legacy_tokenizer: true
legacy_special_token_ids: true
hidden_size: 128
intermediate_size: 384
moe_intermediate_size: 192
num_hidden_layers: 6
num_attention_heads: 4
num_key_value_heads: 1
head_dim: 32
hidden_size_per_layer_input: 16
layer_pattern: ssFssF
sliding_window: 128
max_position_embeddings: 1024
enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2
router_aux_loss_coef: 0.0
The final evaluation loss in the reference run was approximately:
Final loss: 1.5030
This loss should not be interpreted as a benchmark against production models. The model is very small and includes Gemma4 MoE-specific architectural paths primarily for validation coverage. The result is useful because the model trains cleanly and generates coherent TinyStories-like text fragments while remaining compact.
Example generation
Example output from the reference checkpoint:
Prompt: Once upon
Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, red apple on the ground. She picked it up and showed it to her mom.
"Mommy, look what I found!" Lily said.
"That's a big apple, Lily. It's a special apple. It's very special," her mom replied.
Prompt: There was a little
There was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, red apple on the ground. She picked it up and showed it to her mom.
"Mommy, look at the apple!" Lily said.
"That's a nice apple, Lily. It's very pretty," her mom replied.
Lily was happy to have a new apple and wanted to
Prompt: One day
One day, a little girl named Lily went to the park with her mom. She saw a big slide and wanted to try it. But her mom said, "No, Lily. You have to wait. It's not safe."
Lily was sad. She wanted to go on the slide. She asked her mom, "Can I go on the slide?" Her mom said, "No, Lily. You have to wait until the slide is safe."
The model can generate coherent TinyStories-like text fragments. Repetition, template convergence, weak long-form coherence, and repeated character names are expected and are not considered failures for the intended validation purpose.
Usage
If the model files are stored under an hf/ subdirectory, use the following example.
import torch
from transformers import PreTrainedTokenizerFast, Gemma4ForCausalLM
repo = "shibatch/tinygemma4moe3m"
tokenizer = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
model = Gemma4ForCausalLM.from_pretrained(
repo,
subfolder="hf",
torch_dtype=torch.float32,
)
model.eval()
prompt = "Once upon"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=100,
do_sample=False,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
Loading requirements
This checkpoint requires a Transformers version that supports Gemma4 and Gemma4 MoE.
The following imports should work:
from transformers import Gemma4ForCausalLM, Gemma4TextConfig
If these imports fail, update Transformers to a version with Gemma4 support.
Tokenizer loading note
This repository uses a custom byte-level BPE tokenizer saved as a PreTrainedTokenizerFast.
For this reason, examples use:
from transformers import PreTrainedTokenizerFast
instead of AutoTokenizer.
Using AutoTokenizer may fail in some environments if the tokenizer backend cannot be inferred automatically.
The expected tokenizer files include:
tokenizer.json
tokenizer_config.json
special_tokens_map.json
Intended validation coverage
This checkpoint is intended to validate support for:
Gemma4TextConfig
Gemma4ForCausalLM
sliding_attention layers
full_attention layers
GQA with num_key_value_heads = 1
global key/value head configuration
per-layer input embeddings
tied word embeddings
Gemma4 RMSNorm behavior
Gemma4 MLP activation: gelu_pytorch_tanh
Gemma4 MoE expert parameters
num_experts = 4
top_k_experts = 2
expert_interval = 2
MoE expert dispatch
MoE expert output combination
legacy byte-level BPE tokenizer loading
generate()
save_pretrained()
from_pretrained()
Limitations
This is a tiny debug model. It should not be used as a general-purpose language model.
Known limitations:
- TinyStories template convergence
- repeated simple story patterns
- weak long-form coherence
- small vocabulary
- weak semantic consistency
- no instruction tuning
- no chat formatting
- no multimodal capability
- no OCR capability
- no production use
The checkpoint is primarily intended to make Gemma4 MoE text-model code paths easy to test without downloading a large model.
Why include MoE?
A dense tiny Gemma4 model is simpler to train, but it does not cover MoE-specific implementation paths.
This checkpoint intentionally includes:
num_experts = 4
top_k_experts = 2
to exercise a more realistic MoE routing path than a minimal top_k=1 configuration.
Why not full attention only?
A full-attention-only tiny model may train more cleanly, but it would not cover Gemma4 sliding attention behavior.
This checkpoint uses:
sliding_attention
sliding_attention
full_attention
sliding_attention
sliding_attention
full_attention
to cover both attention implementations.
Why a legacy tokenizer?
Earlier tiny Gemma4 MoE training attempts were sensitive to tokenizer details. The legacy byte-level BPE setup used here produced a substantially better tiny-model result than the earlier tokenizer setup.
The tokenizer intentionally keeps byte-level behavior explicit:
ByteLevel(add_prefix_space=False)
initial_alphabet=ByteLevel.alphabet()
special tokens added after BPE training
This is useful for validation because the tokenizer can be saved and loaded through PreTrainedTokenizerFast without requiring SentencePiece or tiktoken inference.
Notes on OCR and multimodal use
This repository is text-only. It does not include a vision tower, image projector, image-token alignment, or OCR training.
A Gemma4 OCR or Gemma4 MoE OCR validation model would be a separate project. It would require a tiny multimodal Gemma4 configuration, a synthetic OCR dataset, image-token handling, vision/text alignment, OCR fine-tuning, and additional validation scripts.
Suggested repository name
Suggested Hugging Face repository name:
shibatch/tinygemma4moe3m
Citation
This is a synthetic tiny validation checkpoint derived from Gemma4-compatible MoE text architecture settings. It is intended for debugging and implementation testing.
docker model run hf.co/shibatch/tinygemma4moe3m