tinygemma4moe5m / README.md
shibatch's picture
Upload README.md with huggingface_hub
ceeeae6 verified
|
Raw
History Blame Contribute Delete
9.39 kB
---
license: mit
language:
- en
tags:
- gemma4
- gemma4-text
- gemma4-moe
- moe
- mixture-of-experts
- causal-lm
- tinystories
- tiny-model
- validation
- debug-model
- transformers
pipeline_tag: text-generation
---
# Tiny Gemma4 MoE Text
This repository contains a tiny Gemma4 text-only Mixture-of-Experts causal language model for validation and debugging.
The model is intentionally small. It is not intended to be a high-quality text generation model. Its main purpose is to provide a compact checkpoint that exercises Gemma4 MoE text-model code paths in Hugging Face Transformers.
This checkpoint is useful for implementation testing because it includes both sliding attention and full attention layers, grouped-query attention, per-layer input embeddings, and MoE routing with multiple experts.
## Model purpose
This model is designed for:
* testing `Gemma4ForCausalLM`
* validating `Gemma4TextConfig`
* testing Gemma4 text-only MoE model loading
* checking model save/load behavior
* checking tokenizer save/load behavior
* exercising sliding attention layers
* exercising full attention layers
* exercising grouped-query attention
* exercising Gemma4 per-layer input embedding paths
* exercising MoE expert parameters
* exercising top-k expert routing
* providing a compact Gemma4 MoE checkpoint for inference-engine validation
It is not designed for:
* high-quality story generation
* instruction following
* chat use
* OCR
* multimodal inference
* benchmark comparison against production language models
* production deployment
## Model architecture
The model uses `Gemma4ForCausalLM` with a small Gemma4 text MoE configuration.
Representative configuration:
```text
model_type: gemma4_text
vocab_size: 1024
vocab_size_per_layer_input: 1024
hidden_size: 160
hidden_size_per_layer_input: 24
intermediate_size: 320
moe_intermediate_size: 320
num_hidden_layers: 6
num_attention_heads: 5
num_key_value_heads: 1
num_global_key_value_heads: 1
head_dim: 32
global_head_dim: 32
sliding_window: 128
max_position_embeddings: 1024
layer_types:
- sliding_attention
- sliding_attention
- full_attention
- sliding_attention
- sliding_attention
- full_attention
hidden_activation: gelu_pytorch_tanh
tie_word_embeddings: true
attention_bias: false
rms_norm_eps: 1e-06
enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2
use_double_wide_mlp: false
pad_token_id: 2
bos_token_id: 0
eos_token_id: 1
```
The attention pattern is:
```text
ssFssF
```
where `s` means `sliding_attention` and `F` means `full_attention`.
This pattern was chosen for validation coverage. A full-attention-only model may be easier to train, but it would not exercise the sliding attention path.
## MoE configuration
This model enables Gemma4 MoE blocks.
```text
enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2
moe_intermediate_size: 320
```
The `num_experts=4` and `top_k_experts=2` setting is intentional. A smaller configuration such as `num_experts=2, top_k=1` would exercise only a much simpler routing path. This checkpoint is intended to cover:
```text
router / gate parameters
multiple experts
top-2 expert selection
weighted expert combination
MoE FFN parameters
dense and MoE layer interaction
```
## Training data
The model was trained on TinyStories-style English story text.
The tokenizer is a small byte-level BPE tokenizer with a vocabulary size of 1024. The small vocabulary keeps the checkpoint compact, but it also limits text generation quality.
## Training setup
Representative training settings:
```text
num_epochs: 1
learning_rate: 2e-4
batch_size: 32
block_size: 256
vocab_size: 1024
hidden_size: 160
intermediate_size: 640
moe_intermediate_size: 320
num_hidden_layers: 6
num_attention_heads: 5
num_key_value_heads: 1
head_dim: 32
hidden_size_per_layer_input: 24
layer_pattern: ssFssF
sliding_window: 128
enable_moe_block: true
num_experts: 4
top_k_experts: 2
expert_interval: 2
```
The final evaluation loss in the reference run was approximately:
```text
Final loss: 2.4662
```
This loss should not be interpreted as a quality benchmark. The model is very small and includes Gemma4 MoE-specific architectural paths primarily for validation coverage.
## Example generation
Example output from the reference checkpoint:
```text
Prompt: Once upon
Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mom said, "Lily, you can't play with your toys. It's not a toy. It's a big, big, big house."
Lily was sad and said, "I'm sorry, Lily. I'm sorry, I'm sorry.
```
```text
Prompt: There was a little
There was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mom said, "Lily, you can't play with your toys. It's not a toy. It's a big toys."
Lily said, "I want to play with it. It's a big toys. It's a big toys. It's a big toys."
Lily
```
```text
Prompt: One day
One day, a little girl named Lily went to the park with her mom. She saw a big, big tree and wanted to play with it. She saw a big, big tree and wanted to play with it. She saw a big tree and wanted to play with it.
"Hello, little girl!" said her mom. "I want to play with you!"
The little
```
The model can generate TinyStories-like text fragments, but repetition and template collapse are expected. This is normal for this checkpoint and is not considered a failure for its intended purpose.
## Usage
If the model files are stored under an `hf/` subdirectory, use the following example.
```python
import torch
from transformers import PreTrainedTokenizerFast, Gemma4ForCausalLM
repo = "shibatch/tinygemma4moe5m"
tokenizer = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
model = Gemma4ForCausalLM.from_pretrained(
repo,
subfolder="hf",
torch_dtype=torch.float32,
)
model.eval()
prompt = "Once upon"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=100,
do_sample=False,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
```
## Loading requirements
This checkpoint requires a Transformers version that supports Gemma4 and Gemma4 MoE.
The following imports should work:
```python
from transformers import Gemma4ForCausalLM, Gemma4TextConfig
```
If these imports fail, update Transformers to a version with Gemma4 support.
## Tokenizer note
This repository uses a custom byte-level BPE tokenizer saved as a `PreTrainedTokenizerFast`.
For this reason, examples use:
```python
from transformers import PreTrainedTokenizerFast
```
instead of `AutoTokenizer`.
Using `AutoTokenizer` may fail in some environments if the tokenizer backend cannot be inferred automatically.
The expected tokenizer files include:
```text
tokenizer.json
tokenizer_config.json
special_tokens_map.json
```
## Intended validation coverage
This checkpoint is intended to validate support for:
```text
Gemma4TextConfig
Gemma4ForCausalLM
sliding_attention layers
full_attention layers
GQA with num_key_value_heads = 1
global key/value head configuration
per-layer input embeddings
tied word embeddings
Gemma4 RMSNorm behavior
Gemma4 MLP activation: gelu_pytorch_tanh
Gemma4 MoE expert parameters
num_experts = 4
top_k_experts = 2
expert_interval = 2
MoE expert dispatch
MoE expert output combination
generate()
save_pretrained()
from_pretrained()
```
## Limitations
This is a tiny debug model. It should not be used as a general-purpose language model.
Known limitations:
* frequent phrase repetition
* TinyStories template collapse
* weak long-form coherence
* small vocabulary
* weak semantic consistency
* no instruction tuning
* no chat formatting
* no multimodal capability
* no OCR capability
* no production use
The checkpoint is primarily intended to make Gemma4 MoE text-model code paths easy to test without downloading a large model.
## Why include MoE?
A dense tiny Gemma4 model is simpler to train, but it does not cover MoE-specific implementation paths.
This checkpoint intentionally includes:
```text
num_experts = 4
top_k_experts = 2
```
to exercise a more realistic MoE routing path than a minimal `top_k=1` configuration.
## Why not full attention only?
A full-attention-only tiny model may train more cleanly, but it would not cover Gemma4 sliding attention behavior.
This checkpoint uses:
```text
sliding_attention
sliding_attention
full_attention
sliding_attention
sliding_attention
full_attention
```
to cover both attention implementations.
## Notes on OCR and multimodal use
This repository is text-only. It does not include a vision tower, image projector, image-token alignment, or OCR training.
A Gemma4 OCR or Gemma4 MoE OCR validation model would be a separate project. It would require a tiny multimodal Gemma4 configuration, a synthetic OCR dataset, image-token handling, vision/text alignment, OCR fine-tuning, and additional validation scripts.
## Suggested repository name
Suggested Hugging Face repository name:
```text
shibatch/tinygemma4moe5m
```
## Citation
This is a synthetic tiny validation checkpoint derived from Gemma4-compatible MoE text architecture settings. It is intended for debugging and implementation testing.