Text Generation
Transformers
Safetensors
English
gemma4
gemma4-text
gemma4-moe
Mixture of Experts
mixture-of-experts
causal-lm
tinystories
tiny-model
validation
debug-model
Instructions to use shibatch/tinygemma4moe5m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use shibatch/tinygemma4moe5m with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="shibatch/tinygemma4moe5m")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("shibatch/tinygemma4moe5m", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use shibatch/tinygemma4moe5m with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "shibatch/tinygemma4moe5m" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shibatch/tinygemma4moe5m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/shibatch/tinygemma4moe5m
- SGLang
How to use shibatch/tinygemma4moe5m with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "shibatch/tinygemma4moe5m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shibatch/tinygemma4moe5m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "shibatch/tinygemma4moe5m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shibatch/tinygemma4moe5m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use shibatch/tinygemma4moe5m with Docker Model Runner:
docker model run hf.co/shibatch/tinygemma4moe5m
| license: mit | |
| language: | |
| - en | |
| tags: | |
| - gemma4 | |
| - gemma4-text | |
| - gemma4-moe | |
| - moe | |
| - mixture-of-experts | |
| - causal-lm | |
| - tinystories | |
| - tiny-model | |
| - validation | |
| - debug-model | |
| - transformers | |
| pipeline_tag: text-generation | |
| # Tiny Gemma4 MoE Text | |
| This repository contains a tiny Gemma4 text-only Mixture-of-Experts causal language model for validation and debugging. | |
| The model is intentionally small. It is not intended to be a high-quality text generation model. Its main purpose is to provide a compact checkpoint that exercises Gemma4 MoE text-model code paths in Hugging Face Transformers. | |
| This checkpoint is useful for implementation testing because it includes both sliding attention and full attention layers, grouped-query attention, per-layer input embeddings, and MoE routing with multiple experts. | |
| ## Model purpose | |
| This model is designed for: | |
| * testing `Gemma4ForCausalLM` | |
| * validating `Gemma4TextConfig` | |
| * testing Gemma4 text-only MoE model loading | |
| * checking model save/load behavior | |
| * checking tokenizer save/load behavior | |
| * exercising sliding attention layers | |
| * exercising full attention layers | |
| * exercising grouped-query attention | |
| * exercising Gemma4 per-layer input embedding paths | |
| * exercising MoE expert parameters | |
| * exercising top-k expert routing | |
| * providing a compact Gemma4 MoE checkpoint for inference-engine validation | |
| It is not designed for: | |
| * high-quality story generation | |
| * instruction following | |
| * chat use | |
| * OCR | |
| * multimodal inference | |
| * benchmark comparison against production language models | |
| * production deployment | |
| ## Model architecture | |
| The model uses `Gemma4ForCausalLM` with a small Gemma4 text MoE configuration. | |
| Representative configuration: | |
| ```text | |
| model_type: gemma4_text | |
| vocab_size: 1024 | |
| vocab_size_per_layer_input: 1024 | |
| hidden_size: 160 | |
| hidden_size_per_layer_input: 24 | |
| intermediate_size: 320 | |
| moe_intermediate_size: 320 | |
| num_hidden_layers: 6 | |
| num_attention_heads: 5 | |
| num_key_value_heads: 1 | |
| num_global_key_value_heads: 1 | |
| head_dim: 32 | |
| global_head_dim: 32 | |
| sliding_window: 128 | |
| max_position_embeddings: 1024 | |
| layer_types: | |
| - sliding_attention | |
| - sliding_attention | |
| - full_attention | |
| - sliding_attention | |
| - sliding_attention | |
| - full_attention | |
| hidden_activation: gelu_pytorch_tanh | |
| tie_word_embeddings: true | |
| attention_bias: false | |
| rms_norm_eps: 1e-06 | |
| enable_moe_block: true | |
| num_experts: 4 | |
| top_k_experts: 2 | |
| expert_interval: 2 | |
| use_double_wide_mlp: false | |
| pad_token_id: 2 | |
| bos_token_id: 0 | |
| eos_token_id: 1 | |
| ``` | |
| The attention pattern is: | |
| ```text | |
| ssFssF | |
| ``` | |
| where `s` means `sliding_attention` and `F` means `full_attention`. | |
| This pattern was chosen for validation coverage. A full-attention-only model may be easier to train, but it would not exercise the sliding attention path. | |
| ## MoE configuration | |
| This model enables Gemma4 MoE blocks. | |
| ```text | |
| enable_moe_block: true | |
| num_experts: 4 | |
| top_k_experts: 2 | |
| expert_interval: 2 | |
| moe_intermediate_size: 320 | |
| ``` | |
| The `num_experts=4` and `top_k_experts=2` setting is intentional. A smaller configuration such as `num_experts=2, top_k=1` would exercise only a much simpler routing path. This checkpoint is intended to cover: | |
| ```text | |
| router / gate parameters | |
| multiple experts | |
| top-2 expert selection | |
| weighted expert combination | |
| MoE FFN parameters | |
| dense and MoE layer interaction | |
| ``` | |
| ## Training data | |
| The model was trained on TinyStories-style English story text. | |
| The tokenizer is a small byte-level BPE tokenizer with a vocabulary size of 1024. The small vocabulary keeps the checkpoint compact, but it also limits text generation quality. | |
| ## Training setup | |
| Representative training settings: | |
| ```text | |
| num_epochs: 1 | |
| learning_rate: 2e-4 | |
| batch_size: 32 | |
| block_size: 256 | |
| vocab_size: 1024 | |
| hidden_size: 160 | |
| intermediate_size: 640 | |
| moe_intermediate_size: 320 | |
| num_hidden_layers: 6 | |
| num_attention_heads: 5 | |
| num_key_value_heads: 1 | |
| head_dim: 32 | |
| hidden_size_per_layer_input: 24 | |
| layer_pattern: ssFssF | |
| sliding_window: 128 | |
| enable_moe_block: true | |
| num_experts: 4 | |
| top_k_experts: 2 | |
| expert_interval: 2 | |
| ``` | |
| The final evaluation loss in the reference run was approximately: | |
| ```text | |
| Final loss: 2.4662 | |
| ``` | |
| This loss should not be interpreted as a quality benchmark. The model is very small and includes Gemma4 MoE-specific architectural paths primarily for validation coverage. | |
| ## Example generation | |
| Example output from the reference checkpoint: | |
| ```text | |
| Prompt: Once upon | |
| Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mom said, "Lily, you can't play with your toys. It's not a toy. It's a big, big, big house." | |
| Lily was sad and said, "I'm sorry, Lily. I'm sorry, I'm sorry. | |
| ``` | |
| ```text | |
| Prompt: There was a little | |
| There was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mom said, "Lily, you can't play with your toys. It's not a toy. It's a big toys." | |
| Lily said, "I want to play with it. It's a big toys. It's a big toys. It's a big toys." | |
| Lily | |
| ``` | |
| ```text | |
| Prompt: One day | |
| One day, a little girl named Lily went to the park with her mom. She saw a big, big tree and wanted to play with it. She saw a big, big tree and wanted to play with it. She saw a big tree and wanted to play with it. | |
| "Hello, little girl!" said her mom. "I want to play with you!" | |
| The little | |
| ``` | |
| The model can generate TinyStories-like text fragments, but repetition and template collapse are expected. This is normal for this checkpoint and is not considered a failure for its intended purpose. | |
| ## Usage | |
| If the model files are stored under an `hf/` subdirectory, use the following example. | |
| ```python | |
| import torch | |
| from transformers import PreTrainedTokenizerFast, Gemma4ForCausalLM | |
| repo = "shibatch/tinygemma4moe5m" | |
| tokenizer = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf") | |
| model = Gemma4ForCausalLM.from_pretrained( | |
| repo, | |
| subfolder="hf", | |
| torch_dtype=torch.float32, | |
| ) | |
| model.eval() | |
| prompt = "Once upon" | |
| inputs = tokenizer(prompt, return_tensors="pt") | |
| with torch.no_grad(): | |
| output_ids = model.generate( | |
| **inputs, | |
| max_new_tokens=100, | |
| do_sample=False, | |
| pad_token_id=tokenizer.pad_token_id, | |
| eos_token_id=tokenizer.eos_token_id, | |
| ) | |
| print(tokenizer.decode(output_ids[0], skip_special_tokens=True)) | |
| ``` | |
| ## Loading requirements | |
| This checkpoint requires a Transformers version that supports Gemma4 and Gemma4 MoE. | |
| The following imports should work: | |
| ```python | |
| from transformers import Gemma4ForCausalLM, Gemma4TextConfig | |
| ``` | |
| If these imports fail, update Transformers to a version with Gemma4 support. | |
| ## Tokenizer note | |
| This repository uses a custom byte-level BPE tokenizer saved as a `PreTrainedTokenizerFast`. | |
| For this reason, examples use: | |
| ```python | |
| from transformers import PreTrainedTokenizerFast | |
| ``` | |
| instead of `AutoTokenizer`. | |
| Using `AutoTokenizer` may fail in some environments if the tokenizer backend cannot be inferred automatically. | |
| The expected tokenizer files include: | |
| ```text | |
| tokenizer.json | |
| tokenizer_config.json | |
| special_tokens_map.json | |
| ``` | |
| ## Intended validation coverage | |
| This checkpoint is intended to validate support for: | |
| ```text | |
| Gemma4TextConfig | |
| Gemma4ForCausalLM | |
| sliding_attention layers | |
| full_attention layers | |
| GQA with num_key_value_heads = 1 | |
| global key/value head configuration | |
| per-layer input embeddings | |
| tied word embeddings | |
| Gemma4 RMSNorm behavior | |
| Gemma4 MLP activation: gelu_pytorch_tanh | |
| Gemma4 MoE expert parameters | |
| num_experts = 4 | |
| top_k_experts = 2 | |
| expert_interval = 2 | |
| MoE expert dispatch | |
| MoE expert output combination | |
| generate() | |
| save_pretrained() | |
| from_pretrained() | |
| ``` | |
| ## Limitations | |
| This is a tiny debug model. It should not be used as a general-purpose language model. | |
| Known limitations: | |
| * frequent phrase repetition | |
| * TinyStories template collapse | |
| * weak long-form coherence | |
| * small vocabulary | |
| * weak semantic consistency | |
| * no instruction tuning | |
| * no chat formatting | |
| * no multimodal capability | |
| * no OCR capability | |
| * no production use | |
| The checkpoint is primarily intended to make Gemma4 MoE text-model code paths easy to test without downloading a large model. | |
| ## Why include MoE? | |
| A dense tiny Gemma4 model is simpler to train, but it does not cover MoE-specific implementation paths. | |
| This checkpoint intentionally includes: | |
| ```text | |
| num_experts = 4 | |
| top_k_experts = 2 | |
| ``` | |
| to exercise a more realistic MoE routing path than a minimal `top_k=1` configuration. | |
| ## Why not full attention only? | |
| A full-attention-only tiny model may train more cleanly, but it would not cover Gemma4 sliding attention behavior. | |
| This checkpoint uses: | |
| ```text | |
| sliding_attention | |
| sliding_attention | |
| full_attention | |
| sliding_attention | |
| sliding_attention | |
| full_attention | |
| ``` | |
| to cover both attention implementations. | |
| ## Notes on OCR and multimodal use | |
| This repository is text-only. It does not include a vision tower, image projector, image-token alignment, or OCR training. | |
| A Gemma4 OCR or Gemma4 MoE OCR validation model would be a separate project. It would require a tiny multimodal Gemma4 configuration, a synthetic OCR dataset, image-token handling, vision/text alignment, OCR fine-tuning, and additional validation scripts. | |
| ## Suggested repository name | |
| Suggested Hugging Face repository name: | |
| ```text | |
| shibatch/tinygemma4moe5m | |
| ``` | |
| ## Citation | |
| This is a synthetic tiny validation checkpoint derived from Gemma4-compatible MoE text architecture settings. It is intended for debugging and implementation testing. | |