Upload README.md with huggingface_hub

ceeeae6 verified 2 days ago

9.39 kB

license: mit
language:
  - en
tags:
  - gemma4
  - gemma4-text
  - gemma4-moe
  - moe
  - mixture-of-experts
  - causal-lm
  - tinystories
  - tiny-model
  - validation
  - debug-model
  - transformers
pipeline_tag: text-generation

Tiny Gemma4 MoE Text

This repository contains a tiny Gemma4 text-only Mixture-of-Experts causal language model for validation and debugging.

The model is intentionally small. It is not intended to be a high-quality text generation model. Its main purpose is to provide a compact checkpoint that exercises Gemma4 MoE text-model code paths in Hugging Face Transformers.

This checkpoint is useful for implementation testing because it includes both sliding attention and full attention layers, grouped-query attention, per-layer input embeddings, and MoE routing with multiple experts.

Model purpose

This model is designed for:

testing Gemma4ForCausalLM
validating Gemma4TextConfig
testing Gemma4 text-only MoE model loading
checking model save/load behavior
checking tokenizer save/load behavior
exercising sliding attention layers
exercising full attention layers
exercising grouped-query attention
exercising Gemma4 per-layer input embedding paths
exercising MoE expert parameters
exercising top-k expert routing
providing a compact Gemma4 MoE checkpoint for inference-engine validation

It is not designed for:

high-quality story generation
instruction following
chat use
OCR
multimodal inference
benchmark comparison against production language models
production deployment

Model architecture

The model uses Gemma4ForCausalLM with a small Gemma4 text MoE configuration.

Representative configuration:

model_type: gemma4_text
vocab_size: 1024
vocab_size_per_layer_input: 1024

hidden_size: 160
hidden_size_per_layer_input: 24
intermediate_size: 320
moe_intermediate_size: 320

num_hidden_layers: 6
num_attention_heads: 5
num_key_value_heads: 1
num_global_key_value_heads: 1
head_dim: 32
global_head_dim: 32

sliding_window: 128
max_position_embeddings: 1024

layer_types:
  - sliding_attention
  - sliding_attention
  - full_attention
  - sliding_attention
  - sliding_attention
  - full_attention

hidden_activation: gelu_pytorch_tanh
tie_word_embeddings: true
attention_bias: false
rms_norm_eps: 1e-06

enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2

use_double_wide_mlp: false

pad_token_id: 2
bos_token_id: 0
eos_token_id: 1

The attention pattern is:

ssFssF

where s means sliding_attention and F means full_attention.

This pattern was chosen for validation coverage. A full-attention-only model may be easier to train, but it would not exercise the sliding attention path.

MoE configuration

This model enables Gemma4 MoE blocks.

enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2
moe_intermediate_size: 320

The num_experts=4 and top_k_experts=2 setting is intentional. A smaller configuration such as num_experts=2, top_k=1 would exercise only a much simpler routing path. This checkpoint is intended to cover:

router / gate parameters
multiple experts
top-2 expert selection
weighted expert combination
MoE FFN parameters
dense and MoE layer interaction

Training data

The model was trained on TinyStories-style English story text.

The tokenizer is a small byte-level BPE tokenizer with a vocabulary size of 1024. The small vocabulary keeps the checkpoint compact, but it also limits text generation quality.

Training setup

Representative training settings:

num_epochs: 1
learning_rate: 2e-4
batch_size: 32
block_size: 256

vocab_size: 1024
hidden_size: 160
intermediate_size: 640
moe_intermediate_size: 320
num_hidden_layers: 6
num_attention_heads: 5
num_key_value_heads: 1
head_dim: 32
hidden_size_per_layer_input: 24
layer_pattern: ssFssF
sliding_window: 128

enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2

The final evaluation loss in the reference run was approximately:

Final loss: 2.4662

This loss should not be interpreted as a quality benchmark. The model is very small and includes Gemma4 MoE-specific architectural paths primarily for validation coverage.

Example generation

Example output from the reference checkpoint:

Prompt: Once upon

Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mom said, "Lily, you can't play with your toys. It's not a toy. It's a big, big, big house."

Lily was sad and said, "I'm sorry, Lily. I'm sorry, I'm sorry.

Prompt: There was a little

There was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mom said, "Lily, you can't play with your toys. It's not a toy. It's a big toys."

Lily said, "I want to play with it. It's a big toys. It's a big toys. It's a big toys."

Lily

Prompt: One day

One day, a little girl named Lily went to the park with her mom. She saw a big, big tree and wanted to play with it. She saw a big, big tree and wanted to play with it. She saw a big tree and wanted to play with it.

"Hello, little girl!" said her mom. "I want to play with you!"

The little

The model can generate TinyStories-like text fragments, but repetition and template collapse are expected. This is normal for this checkpoint and is not considered a failure for its intended purpose.

Usage

If the model files are stored under an hf/ subdirectory, use the following example.

import torch
from transformers import PreTrainedTokenizerFast, Gemma4ForCausalLM

repo = "shibatch/tinygemma4moe5m"

tokenizer = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
model = Gemma4ForCausalLM.from_pretrained(
    repo,
    subfolder="hf",
    torch_dtype=torch.float32,
)
model.eval()

prompt = "Once upon"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=False,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Loading requirements

This checkpoint requires a Transformers version that supports Gemma4 and Gemma4 MoE.

The following imports should work:

from transformers import Gemma4ForCausalLM, Gemma4TextConfig

If these imports fail, update Transformers to a version with Gemma4 support.

Tokenizer note

This repository uses a custom byte-level BPE tokenizer saved as a PreTrainedTokenizerFast.

For this reason, examples use:

from transformers import PreTrainedTokenizerFast

instead of AutoTokenizer.

Using AutoTokenizer may fail in some environments if the tokenizer backend cannot be inferred automatically.

The expected tokenizer files include:

tokenizer.json
tokenizer_config.json
special_tokens_map.json

Intended validation coverage

This checkpoint is intended to validate support for:

Gemma4TextConfig
Gemma4ForCausalLM
sliding_attention layers
full_attention layers
GQA with num_key_value_heads = 1
global key/value head configuration
per-layer input embeddings
tied word embeddings
Gemma4 RMSNorm behavior
Gemma4 MLP activation: gelu_pytorch_tanh
Gemma4 MoE expert parameters
num_experts = 4
top_k_experts = 2
expert_interval = 2
MoE expert dispatch
MoE expert output combination
generate()
save_pretrained()
from_pretrained()

Limitations

This is a tiny debug model. It should not be used as a general-purpose language model.

Known limitations:

frequent phrase repetition
TinyStories template collapse
weak long-form coherence
small vocabulary
weak semantic consistency
no instruction tuning
no chat formatting
no multimodal capability
no OCR capability
no production use

The checkpoint is primarily intended to make Gemma4 MoE text-model code paths easy to test without downloading a large model.

Why include MoE?

A dense tiny Gemma4 model is simpler to train, but it does not cover MoE-specific implementation paths.

This checkpoint intentionally includes:

num_experts = 4
top_k_experts = 2

to exercise a more realistic MoE routing path than a minimal top_k=1 configuration.

Why not full attention only?

A full-attention-only tiny model may train more cleanly, but it would not cover Gemma4 sliding attention behavior.

This checkpoint uses:

sliding_attention
sliding_attention
full_attention
sliding_attention
sliding_attention
full_attention

to cover both attention implementations.

Notes on OCR and multimodal use

This repository is text-only. It does not include a vision tower, image projector, image-token alignment, or OCR training.

A Gemma4 OCR or Gemma4 MoE OCR validation model would be a separate project. It would require a tiny multimodal Gemma4 configuration, a synthetic OCR dataset, image-token handling, vision/text alignment, OCR fine-tuning, and additional validation scripts.

Suggested repository name

Suggested Hugging Face repository name:

shibatch/tinygemma4moe5m

Citation

This is a synthetic tiny validation checkpoint derived from Gemma4-compatible MoE text architecture settings. It is intended for debugging and implementation testing.