Tiny Gemma4 MoE Text 3M

This repository contains an approximately 3M-parameter tiny Gemma4 text-only Mixture-of-Experts causal language model for validation and debugging.

The model is intentionally small. It is not intended to be a high-quality text generation model. Its main purpose is to provide a compact checkpoint that exercises Gemma4 MoE text-model code paths in Hugging Face Transformers and in independent inference engines.

This checkpoint is useful for implementation testing because it includes sliding attention layers, full attention layers, grouped-query attention, per-layer input embeddings, and MoE routing with multiple experts. Compared with the previous larger tiny checkpoint, this 3M version is intended to be the recommended compact validation checkpoint.

Model purpose

This model is designed for:

testing Gemma4ForCausalLM
validating Gemma4TextConfig
testing Gemma4 text-only MoE model loading
checking model save/load behavior
checking tokenizer save/load behavior
exercising sliding attention layers
exercising full attention layers
exercising grouped-query attention
exercising Gemma4 per-layer input embedding paths
exercising MoE expert parameters
exercising top-k expert routing
providing a compact Gemma4 MoE checkpoint for inference-engine validation

It is not designed for:

high-quality story generation
instruction following
chat use
OCR
multimodal inference
benchmark comparison against production language models
production deployment

Model architecture

The model uses Gemma4ForCausalLM with a small Gemma4 text MoE configuration.

Representative configuration:

model_type: gemma4_text
vocab_size: 1024
vocab_size_per_layer_input: 1024

hidden_size: 128
hidden_size_per_layer_input: 16
intermediate_size: 384
intermediate_dim: 192
moe_intermediate_size: 192

num_hidden_layers: 6
num_attention_heads: 4
num_key_value_heads: 1
num_global_key_value_heads: 1
head_dim: 32
global_head_dim: 32

sliding_window: 128
max_position_embeddings: 1024

layer_types:
  - sliding_attention
  - sliding_attention
  - full_attention
  - sliding_attention
  - sliding_attention
  - full_attention

hidden_activation: gelu_pytorch_tanh
tie_word_embeddings: true
attention_bias: false
attention_dropout: 0.0
rms_norm_eps: 1e-06
initializer_range: 0.02

use_cache: true
final_logit_softcapping: null
use_bidirectional_attention: null
attention_k_eq_v: false
num_kv_shared_layers: 0
use_double_wide_mlp: false

enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2
router_aux_loss_coef: 0.0

pad_token_id: 1000
bos_token_id: 1000
eos_token_id: 1001

The attention pattern is:

ssFssF

where s means sliding_attention and F means full_attention.

This pattern was chosen for validation coverage. A full-attention-only model may be easier to train, but it would not exercise the sliding attention path.

MoE configuration

This model enables Gemma4 MoE blocks.

enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2
moe_intermediate_size: 192
intermediate_dim: 192

The num_experts=4 and top_k_experts=2 setting is intentional. A smaller configuration such as num_experts=2, top_k=1 would exercise only a much simpler routing path. This checkpoint is intended to cover:

router / gate parameters
multiple experts
top-2 expert selection
weighted expert combination
MoE FFN parameters
dense and MoE layer interaction

Tokenizer

The model uses a small legacy-style byte-level BPE tokenizer.

The tokenizer is trained with:

RawTokenizer(BPE())
ByteLevel(add_prefix_space=False)
ByteLevelDecoder()
BpeTrainer(
  vocab_size=1000,
  min_frequency=2,
  special_tokens=[],
  initial_alphabet=ByteLevel.alphabet(),
)

After BPE training, the following special tokens are added:

<s>          id 1000
</s>         id 1001
<|im_start|> id 1002

The model config keeps vocab_size=1024, leaving a small reserved range above the actual tokenizer size. The pad token is set to <s>, so pad_token_id and bos_token_id are both 1000.

This tokenizer setup was chosen because the tiny model trained substantially better with this legacy byte-level configuration than with a standard ByteLevelBPETokenizer setup that trains special tokens directly into the vocabulary.

Training data

The model was trained on TinyStories-style English story text.

The small vocabulary keeps the checkpoint compact, but it also limits text generation quality. The model often learns common TinyStories templates, especially stories about Lily, her mom, parks, apples, slides, toys, and simple moral situations.

Training setup

Representative training settings:

num_epochs: 1
learning_rate: 2e-4
batch_size: 32
block_size: 256
max_steps: derived from one epoch

device: auto
dtype: float32 by default
grad_clip: 1.0
error_on_nonfinite_gradients: true
weight_decay: 0.0
seed: 1234

vocab_size: 1024
base_vocab_size: 1000
legacy_tokenizer: true
legacy_special_token_ids: true

hidden_size: 128
intermediate_size: 384
moe_intermediate_size: 192
num_hidden_layers: 6
num_attention_heads: 4
num_key_value_heads: 1
head_dim: 32
hidden_size_per_layer_input: 16
layer_pattern: ssFssF
sliding_window: 128
max_position_embeddings: 1024

enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2
router_aux_loss_coef: 0.0

The final evaluation loss in the reference run was approximately:

Final loss: 1.5030

This loss should not be interpreted as a benchmark against production models. The model is very small and includes Gemma4 MoE-specific architectural paths primarily for validation coverage. The result is useful because the model trains cleanly and generates coherent TinyStories-like text fragments while remaining compact.

Example generation

Example output from the reference checkpoint:

Prompt: Once upon

Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, red apple on the ground. She picked it up and showed it to her mom.

"Mommy, look what I found!" Lily said.

"That's a big apple, Lily. It's a special apple. It's very special," her mom replied.

Prompt: There was a little

There was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, red apple on the ground. She picked it up and showed it to her mom.

"Mommy, look at the apple!" Lily said.

"That's a nice apple, Lily. It's very pretty," her mom replied.

Lily was happy to have a new apple and wanted to

Prompt: One day

One day, a little girl named Lily went to the park with her mom. She saw a big slide and wanted to try it. But her mom said, "No, Lily. You have to wait. It's not safe."

Lily was sad. She wanted to go on the slide. She asked her mom, "Can I go on the slide?" Her mom said, "No, Lily. You have to wait until the slide is safe."

The model can generate coherent TinyStories-like text fragments. Repetition, template convergence, weak long-form coherence, and repeated character names are expected and are not considered failures for the intended validation purpose.

Usage

If the model files are stored under an hf/ subdirectory, use the following example.

import torch
from transformers import PreTrainedTokenizerFast, Gemma4ForCausalLM

repo = "shibatch/tinygemma4moe3m"

tokenizer = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
model = Gemma4ForCausalLM.from_pretrained(
    repo,
    subfolder="hf",
    torch_dtype=torch.float32,
)
model.eval()

prompt = "Once upon"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=False,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Loading requirements

This checkpoint requires a Transformers version that supports Gemma4 and Gemma4 MoE.

The following imports should work:

from transformers import Gemma4ForCausalLM, Gemma4TextConfig

If these imports fail, update Transformers to a version with Gemma4 support.

Tokenizer loading note

This repository uses a custom byte-level BPE tokenizer saved as a PreTrainedTokenizerFast.

For this reason, examples use:

from transformers import PreTrainedTokenizerFast

instead of AutoTokenizer.

Using AutoTokenizer may fail in some environments if the tokenizer backend cannot be inferred automatically.

The expected tokenizer files include:

tokenizer.json
tokenizer_config.json
special_tokens_map.json

Intended validation coverage

This checkpoint is intended to validate support for:

Gemma4TextConfig
Gemma4ForCausalLM
sliding_attention layers
full_attention layers
GQA with num_key_value_heads = 1
global key/value head configuration
per-layer input embeddings
tied word embeddings
Gemma4 RMSNorm behavior
Gemma4 MLP activation: gelu_pytorch_tanh
Gemma4 MoE expert parameters
num_experts = 4
top_k_experts = 2
expert_interval = 2
MoE expert dispatch
MoE expert output combination
legacy byte-level BPE tokenizer loading
generate()
save_pretrained()
from_pretrained()

Limitations

This is a tiny debug model. It should not be used as a general-purpose language model.

Known limitations:

TinyStories template convergence
repeated simple story patterns
weak long-form coherence
small vocabulary
weak semantic consistency
no instruction tuning
no chat formatting
no multimodal capability
no OCR capability
no production use

The checkpoint is primarily intended to make Gemma4 MoE text-model code paths easy to test without downloading a large model.

Why include MoE?

A dense tiny Gemma4 model is simpler to train, but it does not cover MoE-specific implementation paths.

This checkpoint intentionally includes:

num_experts = 4
top_k_experts = 2

to exercise a more realistic MoE routing path than a minimal top_k=1 configuration.

Why not full attention only?

A full-attention-only tiny model may train more cleanly, but it would not cover Gemma4 sliding attention behavior.

This checkpoint uses:

sliding_attention
sliding_attention
full_attention
sliding_attention
sliding_attention
full_attention

to cover both attention implementations.

Why a legacy tokenizer?

Earlier tiny Gemma4 MoE training attempts were sensitive to tokenizer details. The legacy byte-level BPE setup used here produced a substantially better tiny-model result than the earlier tokenizer setup.

The tokenizer intentionally keeps byte-level behavior explicit:

ByteLevel(add_prefix_space=False)
initial_alphabet=ByteLevel.alphabet()
special tokens added after BPE training

This is useful for validation because the tokenizer can be saved and loaded through PreTrainedTokenizerFast without requiring SentencePiece or tiktoken inference.

Notes on OCR and multimodal use

This repository is text-only. It does not include a vision tower, image projector, image-token alignment, or OCR training.

A Gemma4 OCR or Gemma4 MoE OCR validation model would be a separate project. It would require a tiny multimodal Gemma4 configuration, a synthetic OCR dataset, image-token handling, vision/text alignment, OCR fine-tuning, and additional validation scripts.

Suggested repository name

Suggested Hugging Face repository name:

shibatch/tinygemma4moe3m

Citation

This is a synthetic tiny validation checkpoint derived from Gemma4-compatible MoE text architecture settings. It is intended for debugging and implementation testing.

Downloads last month: -; Downloads are not tracked for this model. How to track