--- license: mit language: - en tags: - gemma4 - gemma4-text - gemma4-moe - moe - mixture-of-experts - causal-lm - tinystories - tiny-model - validation - debug-model - transformers pipeline_tag: text-generation --- # Tiny Gemma4 MoE Text This repository contains a tiny Gemma4 text-only Mixture-of-Experts causal language model for validation and debugging. The model is intentionally small. It is not intended to be a high-quality text generation model. Its main purpose is to provide a compact checkpoint that exercises Gemma4 MoE text-model code paths in Hugging Face Transformers. This checkpoint is useful for implementation testing because it includes both sliding attention and full attention layers, grouped-query attention, per-layer input embeddings, and MoE routing with multiple experts. ## Model purpose This model is designed for: * testing `Gemma4ForCausalLM` * validating `Gemma4TextConfig` * testing Gemma4 text-only MoE model loading * checking model save/load behavior * checking tokenizer save/load behavior * exercising sliding attention layers * exercising full attention layers * exercising grouped-query attention * exercising Gemma4 per-layer input embedding paths * exercising MoE expert parameters * exercising top-k expert routing * providing a compact Gemma4 MoE checkpoint for inference-engine validation It is not designed for: * high-quality story generation * instruction following * chat use * OCR * multimodal inference * benchmark comparison against production language models * production deployment ## Model architecture The model uses `Gemma4ForCausalLM` with a small Gemma4 text MoE configuration. Representative configuration: ```text model_type: gemma4_text vocab_size: 1024 vocab_size_per_layer_input: 1024 hidden_size: 160 hidden_size_per_layer_input: 24 intermediate_size: 320 moe_intermediate_size: 320 num_hidden_layers: 6 num_attention_heads: 5 num_key_value_heads: 1 num_global_key_value_heads: 1 head_dim: 32 global_head_dim: 32 sliding_window: 128 max_position_embeddings: 1024 layer_types: - sliding_attention - sliding_attention - full_attention - sliding_attention - sliding_attention - full_attention hidden_activation: gelu_pytorch_tanh tie_word_embeddings: true attention_bias: false rms_norm_eps: 1e-06 enable_moe_block: true num_experts: 4 top_k_experts: 2 expert_interval: 2 use_double_wide_mlp: false pad_token_id: 2 bos_token_id: 0 eos_token_id: 1 ``` The attention pattern is: ```text ssFssF ``` where `s` means `sliding_attention` and `F` means `full_attention`. This pattern was chosen for validation coverage. A full-attention-only model may be easier to train, but it would not exercise the sliding attention path. ## MoE configuration This model enables Gemma4 MoE blocks. ```text enable_moe_block: true num_experts: 4 top_k_experts: 2 expert_interval: 2 moe_intermediate_size: 320 ``` The `num_experts=4` and `top_k_experts=2` setting is intentional. A smaller configuration such as `num_experts=2, top_k=1` would exercise only a much simpler routing path. This checkpoint is intended to cover: ```text router / gate parameters multiple experts top-2 expert selection weighted expert combination MoE FFN parameters dense and MoE layer interaction ``` ## Training data The model was trained on TinyStories-style English story text. The tokenizer is a small byte-level BPE tokenizer with a vocabulary size of 1024. The small vocabulary keeps the checkpoint compact, but it also limits text generation quality. ## Training setup Representative training settings: ```text num_epochs: 1 learning_rate: 2e-4 batch_size: 32 block_size: 256 vocab_size: 1024 hidden_size: 160 intermediate_size: 640 moe_intermediate_size: 320 num_hidden_layers: 6 num_attention_heads: 5 num_key_value_heads: 1 head_dim: 32 hidden_size_per_layer_input: 24 layer_pattern: ssFssF sliding_window: 128 enable_moe_block: true num_experts: 4 top_k_experts: 2 expert_interval: 2 ``` The final evaluation loss in the reference run was approximately: ```text Final loss: 2.4662 ``` This loss should not be interpreted as a quality benchmark. The model is very small and includes Gemma4 MoE-specific architectural paths primarily for validation coverage. ## Example generation Example output from the reference checkpoint: ```text Prompt: Once upon Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mom said, "Lily, you can't play with your toys. It's not a toy. It's a big, big, big house." Lily was sad and said, "I'm sorry, Lily. I'm sorry, I'm sorry. ``` ```text Prompt: There was a little There was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mom said, "Lily, you can't play with your toys. It's not a toy. It's a big toys." Lily said, "I want to play with it. It's a big toys. It's a big toys. It's a big toys." Lily ``` ```text Prompt: One day One day, a little girl named Lily went to the park with her mom. She saw a big, big tree and wanted to play with it. She saw a big, big tree and wanted to play with it. She saw a big tree and wanted to play with it. "Hello, little girl!" said her mom. "I want to play with you!" The little ``` The model can generate TinyStories-like text fragments, but repetition and template collapse are expected. This is normal for this checkpoint and is not considered a failure for its intended purpose. ## Usage If the model files are stored under an `hf/` subdirectory, use the following example. ```python import torch from transformers import PreTrainedTokenizerFast, Gemma4ForCausalLM repo = "shibatch/tinygemma4moe5m" tokenizer = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf") model = Gemma4ForCausalLM.from_pretrained( repo, subfolder="hf", torch_dtype=torch.float32, ) model.eval() prompt = "Once upon" inputs = tokenizer(prompt, return_tensors="pt") with torch.no_grad(): output_ids = model.generate( **inputs, max_new_tokens=100, do_sample=False, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id, ) print(tokenizer.decode(output_ids[0], skip_special_tokens=True)) ``` ## Loading requirements This checkpoint requires a Transformers version that supports Gemma4 and Gemma4 MoE. The following imports should work: ```python from transformers import Gemma4ForCausalLM, Gemma4TextConfig ``` If these imports fail, update Transformers to a version with Gemma4 support. ## Tokenizer note This repository uses a custom byte-level BPE tokenizer saved as a `PreTrainedTokenizerFast`. For this reason, examples use: ```python from transformers import PreTrainedTokenizerFast ``` instead of `AutoTokenizer`. Using `AutoTokenizer` may fail in some environments if the tokenizer backend cannot be inferred automatically. The expected tokenizer files include: ```text tokenizer.json tokenizer_config.json special_tokens_map.json ``` ## Intended validation coverage This checkpoint is intended to validate support for: ```text Gemma4TextConfig Gemma4ForCausalLM sliding_attention layers full_attention layers GQA with num_key_value_heads = 1 global key/value head configuration per-layer input embeddings tied word embeddings Gemma4 RMSNorm behavior Gemma4 MLP activation: gelu_pytorch_tanh Gemma4 MoE expert parameters num_experts = 4 top_k_experts = 2 expert_interval = 2 MoE expert dispatch MoE expert output combination generate() save_pretrained() from_pretrained() ``` ## Limitations This is a tiny debug model. It should not be used as a general-purpose language model. Known limitations: * frequent phrase repetition * TinyStories template collapse * weak long-form coherence * small vocabulary * weak semantic consistency * no instruction tuning * no chat formatting * no multimodal capability * no OCR capability * no production use The checkpoint is primarily intended to make Gemma4 MoE text-model code paths easy to test without downloading a large model. ## Why include MoE? A dense tiny Gemma4 model is simpler to train, but it does not cover MoE-specific implementation paths. This checkpoint intentionally includes: ```text num_experts = 4 top_k_experts = 2 ``` to exercise a more realistic MoE routing path than a minimal `top_k=1` configuration. ## Why not full attention only? A full-attention-only tiny model may train more cleanly, but it would not cover Gemma4 sliding attention behavior. This checkpoint uses: ```text sliding_attention sliding_attention full_attention sliding_attention sliding_attention full_attention ``` to cover both attention implementations. ## Notes on OCR and multimodal use This repository is text-only. It does not include a vision tower, image projector, image-token alignment, or OCR training. A Gemma4 OCR or Gemma4 MoE OCR validation model would be a separate project. It would require a tiny multimodal Gemma4 configuration, a synthetic OCR dataset, image-token handling, vision/text alignment, OCR fine-tuning, and additional validation scripts. ## Suggested repository name Suggested Hugging Face repository name: ```text shibatch/tinygemma4moe5m ``` ## Citation This is a synthetic tiny validation checkpoint derived from Gemma4-compatible MoE text architecture settings. It is intended for debugging and implementation testing.