tinygemma4moe5m / README.md
shibatch's picture
Upload README.md with huggingface_hub
ceeeae6 verified
|
Raw
History Blame Contribute Delete
9.39 kB
metadata
license: mit
language:
  - en
tags:
  - gemma4
  - gemma4-text
  - gemma4-moe
  - moe
  - mixture-of-experts
  - causal-lm
  - tinystories
  - tiny-model
  - validation
  - debug-model
  - transformers
pipeline_tag: text-generation

Tiny Gemma4 MoE Text

This repository contains a tiny Gemma4 text-only Mixture-of-Experts causal language model for validation and debugging.

The model is intentionally small. It is not intended to be a high-quality text generation model. Its main purpose is to provide a compact checkpoint that exercises Gemma4 MoE text-model code paths in Hugging Face Transformers.

This checkpoint is useful for implementation testing because it includes both sliding attention and full attention layers, grouped-query attention, per-layer input embeddings, and MoE routing with multiple experts.

Model purpose

This model is designed for:

  • testing Gemma4ForCausalLM
  • validating Gemma4TextConfig
  • testing Gemma4 text-only MoE model loading
  • checking model save/load behavior
  • checking tokenizer save/load behavior
  • exercising sliding attention layers
  • exercising full attention layers
  • exercising grouped-query attention
  • exercising Gemma4 per-layer input embedding paths
  • exercising MoE expert parameters
  • exercising top-k expert routing
  • providing a compact Gemma4 MoE checkpoint for inference-engine validation

It is not designed for:

  • high-quality story generation
  • instruction following
  • chat use
  • OCR
  • multimodal inference
  • benchmark comparison against production language models
  • production deployment

Model architecture

The model uses Gemma4ForCausalLM with a small Gemma4 text MoE configuration.

Representative configuration:

model_type: gemma4_text
vocab_size: 1024
vocab_size_per_layer_input: 1024

hidden_size: 160
hidden_size_per_layer_input: 24
intermediate_size: 320
moe_intermediate_size: 320

num_hidden_layers: 6
num_attention_heads: 5
num_key_value_heads: 1
num_global_key_value_heads: 1
head_dim: 32
global_head_dim: 32

sliding_window: 128
max_position_embeddings: 1024

layer_types:
  - sliding_attention
  - sliding_attention
  - full_attention
  - sliding_attention
  - sliding_attention
  - full_attention

hidden_activation: gelu_pytorch_tanh
tie_word_embeddings: true
attention_bias: false
rms_norm_eps: 1e-06

enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2

use_double_wide_mlp: false

pad_token_id: 2
bos_token_id: 0
eos_token_id: 1

The attention pattern is:

ssFssF

where s means sliding_attention and F means full_attention.

This pattern was chosen for validation coverage. A full-attention-only model may be easier to train, but it would not exercise the sliding attention path.

MoE configuration

This model enables Gemma4 MoE blocks.

enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2
moe_intermediate_size: 320

The num_experts=4 and top_k_experts=2 setting is intentional. A smaller configuration such as num_experts=2, top_k=1 would exercise only a much simpler routing path. This checkpoint is intended to cover:

router / gate parameters
multiple experts
top-2 expert selection
weighted expert combination
MoE FFN parameters
dense and MoE layer interaction

Training data

The model was trained on TinyStories-style English story text.

The tokenizer is a small byte-level BPE tokenizer with a vocabulary size of 1024. The small vocabulary keeps the checkpoint compact, but it also limits text generation quality.

Training setup

Representative training settings:

num_epochs: 1
learning_rate: 2e-4
batch_size: 32
block_size: 256

vocab_size: 1024
hidden_size: 160
intermediate_size: 640
moe_intermediate_size: 320
num_hidden_layers: 6
num_attention_heads: 5
num_key_value_heads: 1
head_dim: 32
hidden_size_per_layer_input: 24
layer_pattern: ssFssF
sliding_window: 128

enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2

The final evaluation loss in the reference run was approximately:

Final loss: 2.4662

This loss should not be interpreted as a quality benchmark. The model is very small and includes Gemma4 MoE-specific architectural paths primarily for validation coverage.

Example generation

Example output from the reference checkpoint:

Prompt: Once upon

Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mom said, "Lily, you can't play with your toys. It's not a toy. It's a big, big, big house."

Lily was sad and said, "I'm sorry, Lily. I'm sorry, I'm sorry.
Prompt: There was a little

There was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mom said, "Lily, you can't play with your toys. It's not a toy. It's a big toys."

Lily said, "I want to play with it. It's a big toys. It's a big toys. It's a big toys."

Lily
Prompt: One day

One day, a little girl named Lily went to the park with her mom. She saw a big, big tree and wanted to play with it. She saw a big, big tree and wanted to play with it. She saw a big tree and wanted to play with it.

"Hello, little girl!" said her mom. "I want to play with you!"

The little

The model can generate TinyStories-like text fragments, but repetition and template collapse are expected. This is normal for this checkpoint and is not considered a failure for its intended purpose.

Usage

If the model files are stored under an hf/ subdirectory, use the following example.

import torch
from transformers import PreTrainedTokenizerFast, Gemma4ForCausalLM

repo = "shibatch/tinygemma4moe5m"

tokenizer = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
model = Gemma4ForCausalLM.from_pretrained(
    repo,
    subfolder="hf",
    torch_dtype=torch.float32,
)
model.eval()

prompt = "Once upon"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=False,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Loading requirements

This checkpoint requires a Transformers version that supports Gemma4 and Gemma4 MoE.

The following imports should work:

from transformers import Gemma4ForCausalLM, Gemma4TextConfig

If these imports fail, update Transformers to a version with Gemma4 support.

Tokenizer note

This repository uses a custom byte-level BPE tokenizer saved as a PreTrainedTokenizerFast.

For this reason, examples use:

from transformers import PreTrainedTokenizerFast

instead of AutoTokenizer.

Using AutoTokenizer may fail in some environments if the tokenizer backend cannot be inferred automatically.

The expected tokenizer files include:

tokenizer.json
tokenizer_config.json
special_tokens_map.json

Intended validation coverage

This checkpoint is intended to validate support for:

Gemma4TextConfig
Gemma4ForCausalLM
sliding_attention layers
full_attention layers
GQA with num_key_value_heads = 1
global key/value head configuration
per-layer input embeddings
tied word embeddings
Gemma4 RMSNorm behavior
Gemma4 MLP activation: gelu_pytorch_tanh
Gemma4 MoE expert parameters
num_experts = 4
top_k_experts = 2
expert_interval = 2
MoE expert dispatch
MoE expert output combination
generate()
save_pretrained()
from_pretrained()

Limitations

This is a tiny debug model. It should not be used as a general-purpose language model.

Known limitations:

  • frequent phrase repetition
  • TinyStories template collapse
  • weak long-form coherence
  • small vocabulary
  • weak semantic consistency
  • no instruction tuning
  • no chat formatting
  • no multimodal capability
  • no OCR capability
  • no production use

The checkpoint is primarily intended to make Gemma4 MoE text-model code paths easy to test without downloading a large model.

Why include MoE?

A dense tiny Gemma4 model is simpler to train, but it does not cover MoE-specific implementation paths.

This checkpoint intentionally includes:

num_experts = 4
top_k_experts = 2

to exercise a more realistic MoE routing path than a minimal top_k=1 configuration.

Why not full attention only?

A full-attention-only tiny model may train more cleanly, but it would not cover Gemma4 sliding attention behavior.

This checkpoint uses:

sliding_attention
sliding_attention
full_attention
sliding_attention
sliding_attention
full_attention

to cover both attention implementations.

Notes on OCR and multimodal use

This repository is text-only. It does not include a vision tower, image projector, image-token alignment, or OCR training.

A Gemma4 OCR or Gemma4 MoE OCR validation model would be a separate project. It would require a tiny multimodal Gemma4 configuration, a synthetic OCR dataset, image-token handling, vision/text alignment, OCR fine-tuning, and additional validation scripts.

Suggested repository name

Suggested Hugging Face repository name:

shibatch/tinygemma4moe5m

Citation

This is a synthetic tiny validation checkpoint derived from Gemma4-compatible MoE text architecture settings. It is intended for debugging and implementation testing.