---
license: mit
language:
- en
tags:
- gemma4
- gemma4-text
- gemma4-moe
- moe
- mixture-of-experts
- causal-lm
- tinystories
- tiny-model
- validation
- debug-model
- transformers
pipeline_tag: text-generation
---

# Tiny Gemma4 MoE Text

This repository contains a tiny Gemma4 text-only Mixture-of-Experts causal language model for validation and debugging.

The model is intentionally small. It is not intended to be a high-quality text generation model. Its main purpose is to provide a compact checkpoint that exercises Gemma4 MoE text-model code paths in Hugging Face Transformers.

This checkpoint is useful for implementation testing because it includes both sliding attention and full attention layers, grouped-query attention, per-layer input embeddings, and MoE routing with multiple experts.

## Model purpose

This model is designed for:

* testing `Gemma4ForCausalLM`
* validating `Gemma4TextConfig`
* testing Gemma4 text-only MoE model loading
* checking model save/load behavior
* checking tokenizer save/load behavior
* exercising sliding attention layers
* exercising full attention layers
* exercising grouped-query attention
* exercising Gemma4 per-layer input embedding paths
* exercising MoE expert parameters
* exercising top-k expert routing
* providing a compact Gemma4 MoE checkpoint for inference-engine validation

It is not designed for:

* high-quality story generation
* instruction following
* chat use
* OCR
* multimodal inference
* benchmark comparison against production language models
* production deployment

## Model architecture

The model uses `Gemma4ForCausalLM` with a small Gemma4 text MoE configuration.

Representative configuration:

```text
model_type: gemma4_text
vocab_size: 1024
vocab_size_per_layer_input: 1024

hidden_size: 160
hidden_size_per_layer_input: 24
intermediate_size: 320
moe_intermediate_size: 320

num_hidden_layers: 6
num_attention_heads: 5
num_key_value_heads: 1
num_global_key_value_heads: 1
head_dim: 32
global_head_dim: 32

sliding_window: 128
max_position_embeddings: 1024

layer_types:
  - sliding_attention
  - sliding_attention
  - full_attention
  - sliding_attention
  - sliding_attention
  - full_attention

hidden_activation: gelu_pytorch_tanh
tie_word_embeddings: true
attention_bias: false
rms_norm_eps: 1e-06

enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2

use_double_wide_mlp: false

pad_token_id: 2
bos_token_id: 0
eos_token_id: 1
```

The attention pattern is:

```text
ssFssF
```

where `s` means `sliding_attention` and `F` means `full_attention`.

This pattern was chosen for validation coverage. A full-attention-only model may be easier to train, but it would not exercise the sliding attention path.

## MoE configuration

This model enables Gemma4 MoE blocks.

```text
enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2
moe_intermediate_size: 320
```

The `num_experts=4` and `top_k_experts=2` setting is intentional. A smaller configuration such as `num_experts=2, top_k=1` would exercise only a much simpler routing path. This checkpoint is intended to cover:

```text
router / gate parameters
multiple experts
top-2 expert selection
weighted expert combination
MoE FFN parameters
dense and MoE layer interaction
```

## Training data

The model was trained on TinyStories-style English story text.

The tokenizer is a small byte-level BPE tokenizer with a vocabulary size of 1024. The small vocabulary keeps the checkpoint compact, but it also limits text generation quality.

## Training setup

Representative training settings:

```text
num_epochs: 1
learning_rate: 2e-4
batch_size: 32
block_size: 256

vocab_size: 1024
hidden_size: 160
intermediate_size: 640
moe_intermediate_size: 320
num_hidden_layers: 6
num_attention_heads: 5
num_key_value_heads: 1
head_dim: 32
hidden_size_per_layer_input: 24
layer_pattern: ssFssF
sliding_window: 128

enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2
```

The final evaluation loss in the reference run was approximately:

```text
Final loss: 2.4662
```

This loss should not be interpreted as a quality benchmark. The model is very small and includes Gemma4 MoE-specific architectural paths primarily for validation coverage.

## Example generation

Example output from the reference checkpoint:

```text
Prompt: Once upon

Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mom said, "Lily, you can't play with your toys. It's not a toy. It's a big, big, big house."

Lily was sad and said, "I'm sorry, Lily. I'm sorry, I'm sorry.
```

```text
Prompt: There was a little

There was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mom said, "Lily, you can't play with your toys. It's not a toy. It's a big toys."

Lily said, "I want to play with it. It's a big toys. It's a big toys. It's a big toys."

Lily
```

```text
Prompt: One day

One day, a little girl named Lily went to the park with her mom. She saw a big, big tree and wanted to play with it. She saw a big, big tree and wanted to play with it. She saw a big tree and wanted to play with it.

"Hello, little girl!" said her mom. "I want to play with you!"

The little
```

The model can generate TinyStories-like text fragments, but repetition and template collapse are expected. This is normal for this checkpoint and is not considered a failure for its intended purpose.

## Usage

If the model files are stored under an `hf/` subdirectory, use the following example.

```python
import torch
from transformers import PreTrainedTokenizerFast, Gemma4ForCausalLM

repo = "shibatch/tinygemma4moe5m"

tokenizer = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
model = Gemma4ForCausalLM.from_pretrained(
    repo,
    subfolder="hf",
    torch_dtype=torch.float32,
)
model.eval()

prompt = "Once upon"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=False,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
```

## Loading requirements

This checkpoint requires a Transformers version that supports Gemma4 and Gemma4 MoE.

The following imports should work:

```python
from transformers import Gemma4ForCausalLM, Gemma4TextConfig
```

If these imports fail, update Transformers to a version with Gemma4 support.

## Tokenizer note

This repository uses a custom byte-level BPE tokenizer saved as a `PreTrainedTokenizerFast`.

For this reason, examples use:

```python
from transformers import PreTrainedTokenizerFast
```

instead of `AutoTokenizer`.

Using `AutoTokenizer` may fail in some environments if the tokenizer backend cannot be inferred automatically.

The expected tokenizer files include:

```text
tokenizer.json
tokenizer_config.json
special_tokens_map.json
```

## Intended validation coverage

This checkpoint is intended to validate support for:

```text
Gemma4TextConfig
Gemma4ForCausalLM
sliding_attention layers
full_attention layers
GQA with num_key_value_heads = 1
global key/value head configuration
per-layer input embeddings
tied word embeddings
Gemma4 RMSNorm behavior
Gemma4 MLP activation: gelu_pytorch_tanh
Gemma4 MoE expert parameters
num_experts = 4
top_k_experts = 2
expert_interval = 2
MoE expert dispatch
MoE expert output combination
generate()
save_pretrained()
from_pretrained()
```

## Limitations

This is a tiny debug model. It should not be used as a general-purpose language model.

Known limitations:

* frequent phrase repetition
* TinyStories template collapse
* weak long-form coherence
* small vocabulary
* weak semantic consistency
* no instruction tuning
* no chat formatting
* no multimodal capability
* no OCR capability
* no production use

The checkpoint is primarily intended to make Gemma4 MoE text-model code paths easy to test without downloading a large model.

## Why include MoE?

A dense tiny Gemma4 model is simpler to train, but it does not cover MoE-specific implementation paths.

This checkpoint intentionally includes:

```text
num_experts = 4
top_k_experts = 2
```

to exercise a more realistic MoE routing path than a minimal `top_k=1` configuration.

## Why not full attention only?

A full-attention-only tiny model may train more cleanly, but it would not cover Gemma4 sliding attention behavior.

This checkpoint uses:

```text
sliding_attention
sliding_attention
full_attention
sliding_attention
sliding_attention
full_attention
```

to cover both attention implementations.

## Notes on OCR and multimodal use

This repository is text-only. It does not include a vision tower, image projector, image-token alignment, or OCR training.

A Gemma4 OCR or Gemma4 MoE OCR validation model would be a separate project. It would require a tiny multimodal Gemma4 configuration, a synthetic OCR dataset, image-token handling, vision/text alignment, OCR fine-tuning, and additional validation scripts.

## Suggested repository name

Suggested Hugging Face repository name:

```text
shibatch/tinygemma4moe5m
```

## Citation

This is a synthetic tiny validation checkpoint derived from Gemma4-compatible MoE text architecture settings. It is intended for debugging and implementation testing.