Upload README.md with huggingface_hub

8949da7 verified 3 days ago

7.29 kB

	---
	license: mit
	tags:
	- gemma4
	- gemma4-text
	- causal-lm
	- tinystories
	- tiny-model
	- validation
	- debug-model
	- transformers
	---

	# Tiny Gemma4 Text 3M

	This repository contains a tiny Gemma4 text-only causal language model for validation and debugging.

	The model is intentionally small. It is not intended to be a high-quality text generation model. Its main purpose is to provide a compact checkpoint that exercises the Gemma4 text stack in Hugging Face Transformers, including sliding attention, full attention, grouped-query attention, and per-layer input embeddings.

	## Model purpose

	This model is designed for:

	* testing `Gemma4ForCausalLM`
	* validating `Gemma4TextConfig`
	* checking model load/save behavior
	* testing tokenizer load/save behavior
	* exercising both sliding and full attention layers
	* exercising grouped-query attention
	* exercising Gemma4 per-layer input embedding paths
	* providing a small Gemma4 checkpoint for inference-engine validation

	It is not designed for:

	* high-quality story generation
	* benchmark comparison against production language models
	* instruction following
	* general OCR
	* multimodal inference
	* chat use

	## Model architecture

	The model uses `Gemma4ForCausalLM` with a small `Gemma4TextConfig`.

	```text
	model_type: gemma4_text
	vocab_size: 1024
	vocab_size_per_layer_input: 1024
	hidden_size: 160
	hidden_size_per_layer_input: 24
	intermediate_size: 640
	num_hidden_layers: 6
	num_attention_heads: 5
	num_key_value_heads: 1
	num_global_key_value_heads: 1
	head_dim: 32
	global_head_dim: 32
	sliding_window: 128
	max_position_embeddings: 1024
	layer_types:
	- sliding_attention
	- sliding_attention
	- full_attention
	- sliding_attention
	- sliding_attention
	- full_attention
	hidden_activation: gelu_pytorch_tanh
	tie_word_embeddings: true
	attention_bias: false
	rms_norm_eps: 1e-06
	enable_moe_block: false
	use_double_wide_mlp: false
	pad_token_id: 2
	bos_token_id: 0
	eos_token_id: 1
	```

	The attention pattern is:

	```text
	ssFssF
	```

	where `s` means `sliding_attention` and `F` means `full_attention`.

	This pattern was chosen for validation coverage. A full-attention-only model would be easier to train, but it would not exercise the sliding attention path. This model intentionally includes both attention types.

	## Parameter count

	```text
	total parameters: 2,597,624
	trainable parameters: 2,597,624
	```

	Top-level breakdown:

	```text
	model: 2,597,624
	lm_head: 163,840
	```

	Prefix breakdown:

	```text
	model.embed_tokens: 163,840
	model.embed_tokens_per_layer: 147,456
	model.layers.0: 377,184
	model.layers.1: 377,184
	model.layers.2: 377,184
	model.layers.3: 377,184
	model.layers.4: 377,184
	model.layers.5: 377,184
	model.norm: 160
	model.per_layer_model_projection: 23,040
	model.per_layer_projection_norm: 24
	```

	## Training data

	The model was trained on TinyStories-style English story text.

	The tokenizer is a small byte-level BPE tokenizer with a vocabulary size of 1024. This small vocabulary is intentional: it keeps the checkpoint compact and reduces embedding size, but it also limits text generation quality.

	## Training setup

	The model was trained as a compact text-only Gemma4 validation model.

	Representative training settings:

	```text
	num_epochs: 1
	learning_rate: 2e-4
	batch_size: 32
	block_size: 256
	vocab_size: 1024
	hidden_size: 160
	intermediate_size: 640
	num_hidden_layers: 6
	num_attention_heads: 5
	num_key_value_heads: 1
	head_dim: 32
	hidden_size_per_layer_input: 24
	layer_pattern: ssFssF
	sliding_window: 128
	```

	The final training loss in the reference run was approximately:

	```text
	Final loss: 3.1163
	```

	This value should not be interpreted as a quality benchmark. The model is very small and includes Gemma4-specific architectural paths primarily for validation coverage.

	## Example generation

	Example output from the reference checkpoint:

	```text
	Prompt: Once upon

	Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mom said, "We can play with the toys, but you can't play with it. You can play with it."
	```

	The model can generate TinyStories-like text fragments, but repetitions and template collapse are expected. This is normal for this checkpoint and is not considered a failure for its intended purpose.

	## Usage

	```python
	import torch
	from transformers import PreTrainedTokenizerFast, Gemma4ForCausalLM

	repo = "shibatch/tinygemma4text3m"

	tokenizer = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
	model = Gemma4ForCausalLM.from_pretrained(
	repo,
	subfolder="hf",
	torch_dtype=torch.float32,
	)
	model.eval()

	prompt = "Once upon"
	inputs = tokenizer(prompt, return_tensors="pt")

	with torch.no_grad():
	output_ids = model.generate(
	**inputs,
	max_new_tokens=80,
	do_sample=False,
	pad_token_id=tokenizer.pad_token_id,
	eos_token_id=tokenizer.eos_token_id,
	)

	print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
	```


	## Loading with Transformers

	This checkpoint requires a Transformers version that supports Gemma4.

	```python
	from transformers import Gemma4ForCausalLM, Gemma4TextConfig
	```

	If this import fails, update Transformers to a version with Gemma4 support.

	## Intended validation coverage

	This model is useful for checking that an implementation supports:

	```text
	Gemma4TextConfig
	Gemma4ForCausalLM
	sliding_attention layers
	full_attention layers
	GQA with num_key_value_heads = 1
	global key/value head configuration
	per-layer input embeddings
	tied word embeddings
	Gemma4 RMSNorm behavior
	Gemma4 MLP activation: gelu_pytorch_tanh
	generate()
	save_pretrained()
	from_pretrained()
	```

	## Limitations

	This is a tiny debug model. It should not be used as a general-purpose language model.

	Known limitations:

	* frequent phrase repetition
	* weak long-form coherence
	* frequent TinyStories template collapse
	* small vocabulary
	* weak semantic consistency
	* no instruction tuning
	* no chat formatting
	* no multimodal capability
	* no OCR capability

	The checkpoint is primarily intended to make Gemma4 text-model code paths easy to test without downloading a large model.

	## Why not full attention only?

	A full-attention-only tiny model may train more cleanly, but it would not cover Gemma4 sliding attention behavior. Since this checkpoint is intended for implementation validation, it uses a mixed attention pattern:

	```text
	sliding_attention
	sliding_attention
	full_attention
	sliding_attention
	sliding_attention
	full_attention
	```

	This provides better code-path coverage than `FFFFFF`.

	## Notes on OCR and multimodal use

	This repository is text-only. It does not include a vision tower, image projector, image token alignment, or OCR training.

	A Gemma4 OCR validation model would be a separate project. It would require a tiny multimodal Gemma4 configuration, a synthetic OCR dataset, image-token handling, vision/text alignment, OCR fine-tuning, and additional validation scripts.

	## Citation

	This is a synthetic tiny validation checkpoint derived from Gemma4-compatible architecture settings. It is intended for debugging and implementation testing.