Upload README.md with huggingface_hub

ceeeae6 verified 2 days ago

9.39 kB

	---
	license: mit
	language:
	- en
	tags:
	- gemma4
	- gemma4-text
	- gemma4-moe
	- moe
	- mixture-of-experts
	- causal-lm
	- tinystories
	- tiny-model
	- validation
	- debug-model
	- transformers
	pipeline_tag: text-generation
	---

	# Tiny Gemma4 MoE Text

	This repository contains a tiny Gemma4 text-only Mixture-of-Experts causal language model for validation and debugging.

	The model is intentionally small. It is not intended to be a high-quality text generation model. Its main purpose is to provide a compact checkpoint that exercises Gemma4 MoE text-model code paths in Hugging Face Transformers.

	This checkpoint is useful for implementation testing because it includes both sliding attention and full attention layers, grouped-query attention, per-layer input embeddings, and MoE routing with multiple experts.

	## Model purpose

	This model is designed for:

	* testing `Gemma4ForCausalLM`
	* validating `Gemma4TextConfig`
	* testing Gemma4 text-only MoE model loading
	* checking model save/load behavior
	* checking tokenizer save/load behavior
	* exercising sliding attention layers
	* exercising full attention layers
	* exercising grouped-query attention
	* exercising Gemma4 per-layer input embedding paths
	* exercising MoE expert parameters
	* exercising top-k expert routing
	* providing a compact Gemma4 MoE checkpoint for inference-engine validation

	It is not designed for:

	* high-quality story generation
	* instruction following
	* chat use
	* OCR
	* multimodal inference
	* benchmark comparison against production language models
	* production deployment

	## Model architecture

	The model uses `Gemma4ForCausalLM` with a small Gemma4 text MoE configuration.

	Representative configuration:

	```text
	model_type: gemma4_text
	vocab_size: 1024
	vocab_size_per_layer_input: 1024

	hidden_size: 160
	hidden_size_per_layer_input: 24
	intermediate_size: 320
	moe_intermediate_size: 320

	num_hidden_layers: 6
	num_attention_heads: 5
	num_key_value_heads: 1
	num_global_key_value_heads: 1
	head_dim: 32
	global_head_dim: 32

	sliding_window: 128
	max_position_embeddings: 1024

	layer_types:
	- sliding_attention
	- sliding_attention
	- full_attention
	- sliding_attention
	- sliding_attention
	- full_attention

	hidden_activation: gelu_pytorch_tanh
	tie_word_embeddings: true
	attention_bias: false
	rms_norm_eps: 1e-06

	enable_moe_block: true
	num_experts: 4
	top_k_experts: 2
	expert_interval: 2

	use_double_wide_mlp: false

	pad_token_id: 2
	bos_token_id: 0
	eos_token_id: 1
	```

	The attention pattern is:

	```text
	ssFssF
	```

	where `s` means `sliding_attention` and `F` means `full_attention`.

	This pattern was chosen for validation coverage. A full-attention-only model may be easier to train, but it would not exercise the sliding attention path.

	## MoE configuration

	This model enables Gemma4 MoE blocks.

	```text
	enable_moe_block: true
	num_experts: 4
	top_k_experts: 2
	expert_interval: 2
	moe_intermediate_size: 320
	```

	The `num_experts=4` and `top_k_experts=2` setting is intentional. A smaller configuration such as `num_experts=2, top_k=1` would exercise only a much simpler routing path. This checkpoint is intended to cover:

	```text
	router / gate parameters
	multiple experts
	top-2 expert selection
	weighted expert combination
	MoE FFN parameters
	dense and MoE layer interaction
	```

	## Training data

	The model was trained on TinyStories-style English story text.

	The tokenizer is a small byte-level BPE tokenizer with a vocabulary size of 1024. The small vocabulary keeps the checkpoint compact, but it also limits text generation quality.

	## Training setup

	Representative training settings:

	```text
	num_epochs: 1
	learning_rate: 2e-4
	batch_size: 32
	block_size: 256

	vocab_size: 1024
	hidden_size: 160
	intermediate_size: 640
	moe_intermediate_size: 320
	num_hidden_layers: 6
	num_attention_heads: 5
	num_key_value_heads: 1
	head_dim: 32
	hidden_size_per_layer_input: 24
	layer_pattern: ssFssF
	sliding_window: 128

	enable_moe_block: true
	num_experts: 4
	top_k_experts: 2
	expert_interval: 2
	```

	The final evaluation loss in the reference run was approximately:

	```text
	Final loss: 2.4662
	```

	This loss should not be interpreted as a quality benchmark. The model is very small and includes Gemma4 MoE-specific architectural paths primarily for validation coverage.

	## Example generation

	Example output from the reference checkpoint:

	```text
	Prompt: Once upon

	Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mom said, "Lily, you can't play with your toys. It's not a toy. It's a big, big, big house."

	Lily was sad and said, "I'm sorry, Lily. I'm sorry, I'm sorry.
	```

	```text
	Prompt: There was a little

	There was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mom said, "Lily, you can't play with your toys. It's not a toy. It's a big toys."

	Lily said, "I want to play with it. It's a big toys. It's a big toys. It's a big toys."

	Lily
	```

	```text
	Prompt: One day

	One day, a little girl named Lily went to the park with her mom. She saw a big, big tree and wanted to play with it. She saw a big, big tree and wanted to play with it. She saw a big tree and wanted to play with it.

	"Hello, little girl!" said her mom. "I want to play with you!"

	The little
	```

	The model can generate TinyStories-like text fragments, but repetition and template collapse are expected. This is normal for this checkpoint and is not considered a failure for its intended purpose.

	## Usage

	If the model files are stored under an `hf/` subdirectory, use the following example.

	```python
	import torch
	from transformers import PreTrainedTokenizerFast, Gemma4ForCausalLM

	repo = "shibatch/tinygemma4moe5m"

	tokenizer = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
	model = Gemma4ForCausalLM.from_pretrained(
	repo,
	subfolder="hf",
	torch_dtype=torch.float32,
	)
	model.eval()

	prompt = "Once upon"
	inputs = tokenizer(prompt, return_tensors="pt")

	with torch.no_grad():
	output_ids = model.generate(
	**inputs,
	max_new_tokens=100,
	do_sample=False,
	pad_token_id=tokenizer.pad_token_id,
	eos_token_id=tokenizer.eos_token_id,
	)

	print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
	```

	## Loading requirements

	This checkpoint requires a Transformers version that supports Gemma4 and Gemma4 MoE.

	The following imports should work:

	```python
	from transformers import Gemma4ForCausalLM, Gemma4TextConfig
	```

	If these imports fail, update Transformers to a version with Gemma4 support.

	## Tokenizer note

	This repository uses a custom byte-level BPE tokenizer saved as a `PreTrainedTokenizerFast`.

	For this reason, examples use:

	```python
	from transformers import PreTrainedTokenizerFast
	```

	instead of `AutoTokenizer`.

	Using `AutoTokenizer` may fail in some environments if the tokenizer backend cannot be inferred automatically.

	The expected tokenizer files include:

	```text
	tokenizer.json
	tokenizer_config.json
	special_tokens_map.json
	```

	## Intended validation coverage

	This checkpoint is intended to validate support for:

	```text
	Gemma4TextConfig
	Gemma4ForCausalLM
	sliding_attention layers
	full_attention layers
	GQA with num_key_value_heads = 1
	global key/value head configuration
	per-layer input embeddings
	tied word embeddings
	Gemma4 RMSNorm behavior
	Gemma4 MLP activation: gelu_pytorch_tanh
	Gemma4 MoE expert parameters
	num_experts = 4
	top_k_experts = 2
	expert_interval = 2
	MoE expert dispatch
	MoE expert output combination
	generate()
	save_pretrained()
	from_pretrained()
	```

	## Limitations

	This is a tiny debug model. It should not be used as a general-purpose language model.

	Known limitations:

	* frequent phrase repetition
	* TinyStories template collapse
	* weak long-form coherence
	* small vocabulary
	* weak semantic consistency
	* no instruction tuning
	* no chat formatting
	* no multimodal capability
	* no OCR capability
	* no production use

	The checkpoint is primarily intended to make Gemma4 MoE text-model code paths easy to test without downloading a large model.

	## Why include MoE?

	A dense tiny Gemma4 model is simpler to train, but it does not cover MoE-specific implementation paths.

	This checkpoint intentionally includes:

	```text
	num_experts = 4
	top_k_experts = 2
	```

	to exercise a more realistic MoE routing path than a minimal `top_k=1` configuration.

	## Why not full attention only?

	A full-attention-only tiny model may train more cleanly, but it would not cover Gemma4 sliding attention behavior.

	This checkpoint uses:

	```text
	sliding_attention
	sliding_attention
	full_attention
	sliding_attention
	sliding_attention
	full_attention
	```

	to cover both attention implementations.

	## Notes on OCR and multimodal use

	This repository is text-only. It does not include a vision tower, image projector, image-token alignment, or OCR training.

	A Gemma4 OCR or Gemma4 MoE OCR validation model would be a separate project. It would require a tiny multimodal Gemma4 configuration, a synthetic OCR dataset, image-token handling, vision/text alignment, OCR fine-tuning, and additional validation scripts.

	## Suggested repository name

	Suggested Hugging Face repository name:

	```text
	shibatch/tinygemma4moe5m
	```

	## Citation

	This is a synthetic tiny validation checkpoint derived from Gemma4-compatible MoE text architecture settings. It is intended for debugging and implementation testing.