Instructions to use shibatch/tinygemma4text3m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use shibatch/tinygemma4text3m with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("shibatch/tinygemma4text3m", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| tags: | |
| - gemma4 | |
| - gemma4-text | |
| - causal-lm | |
| - tinystories | |
| - tiny-model | |
| - validation | |
| - debug-model | |
| - transformers | |
| # Tiny Gemma4 Text 3M | |
| This repository contains a tiny Gemma4 text-only causal language model for validation and debugging. | |
| The model is intentionally small. It is not intended to be a high-quality text generation model. Its main purpose is to provide a compact checkpoint that exercises the Gemma4 text stack in Hugging Face Transformers, including sliding attention, full attention, grouped-query attention, and per-layer input embeddings. | |
| ## Model purpose | |
| This model is designed for: | |
| * testing `Gemma4ForCausalLM` | |
| * validating `Gemma4TextConfig` | |
| * checking model load/save behavior | |
| * testing tokenizer load/save behavior | |
| * exercising both sliding and full attention layers | |
| * exercising grouped-query attention | |
| * exercising Gemma4 per-layer input embedding paths | |
| * providing a small Gemma4 checkpoint for inference-engine validation | |
| It is not designed for: | |
| * high-quality story generation | |
| * benchmark comparison against production language models | |
| * instruction following | |
| * general OCR | |
| * multimodal inference | |
| * chat use | |
| ## Model architecture | |
| The model uses `Gemma4ForCausalLM` with a small `Gemma4TextConfig`. | |
| ```text | |
| model_type: gemma4_text | |
| vocab_size: 1024 | |
| vocab_size_per_layer_input: 1024 | |
| hidden_size: 160 | |
| hidden_size_per_layer_input: 24 | |
| intermediate_size: 640 | |
| num_hidden_layers: 6 | |
| num_attention_heads: 5 | |
| num_key_value_heads: 1 | |
| num_global_key_value_heads: 1 | |
| head_dim: 32 | |
| global_head_dim: 32 | |
| sliding_window: 128 | |
| max_position_embeddings: 1024 | |
| layer_types: | |
| - sliding_attention | |
| - sliding_attention | |
| - full_attention | |
| - sliding_attention | |
| - sliding_attention | |
| - full_attention | |
| hidden_activation: gelu_pytorch_tanh | |
| tie_word_embeddings: true | |
| attention_bias: false | |
| rms_norm_eps: 1e-06 | |
| enable_moe_block: false | |
| use_double_wide_mlp: false | |
| pad_token_id: 2 | |
| bos_token_id: 0 | |
| eos_token_id: 1 | |
| ``` | |
| The attention pattern is: | |
| ```text | |
| ssFssF | |
| ``` | |
| where `s` means `sliding_attention` and `F` means `full_attention`. | |
| This pattern was chosen for validation coverage. A full-attention-only model would be easier to train, but it would not exercise the sliding attention path. This model intentionally includes both attention types. | |
| ## Parameter count | |
| ```text | |
| total parameters: 2,597,624 | |
| trainable parameters: 2,597,624 | |
| ``` | |
| Top-level breakdown: | |
| ```text | |
| model: 2,597,624 | |
| lm_head: 163,840 | |
| ``` | |
| Prefix breakdown: | |
| ```text | |
| model.embed_tokens: 163,840 | |
| model.embed_tokens_per_layer: 147,456 | |
| model.layers.0: 377,184 | |
| model.layers.1: 377,184 | |
| model.layers.2: 377,184 | |
| model.layers.3: 377,184 | |
| model.layers.4: 377,184 | |
| model.layers.5: 377,184 | |
| model.norm: 160 | |
| model.per_layer_model_projection: 23,040 | |
| model.per_layer_projection_norm: 24 | |
| ``` | |
| ## Training data | |
| The model was trained on TinyStories-style English story text. | |
| The tokenizer is a small byte-level BPE tokenizer with a vocabulary size of 1024. This small vocabulary is intentional: it keeps the checkpoint compact and reduces embedding size, but it also limits text generation quality. | |
| ## Training setup | |
| The model was trained as a compact text-only Gemma4 validation model. | |
| Representative training settings: | |
| ```text | |
| num_epochs: 1 | |
| learning_rate: 2e-4 | |
| batch_size: 32 | |
| block_size: 256 | |
| vocab_size: 1024 | |
| hidden_size: 160 | |
| intermediate_size: 640 | |
| num_hidden_layers: 6 | |
| num_attention_heads: 5 | |
| num_key_value_heads: 1 | |
| head_dim: 32 | |
| hidden_size_per_layer_input: 24 | |
| layer_pattern: ssFssF | |
| sliding_window: 128 | |
| ``` | |
| The final training loss in the reference run was approximately: | |
| ```text | |
| Final loss: 3.1163 | |
| ``` | |
| This value should not be interpreted as a quality benchmark. The model is very small and includes Gemma4-specific architectural paths primarily for validation coverage. | |
| ## Example generation | |
| Example output from the reference checkpoint: | |
| ```text | |
| Prompt: Once upon | |
| Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily's mom said, "We can play with the toys, but you can't play with it. You can play with it." | |
| ``` | |
| The model can generate TinyStories-like text fragments, but repetitions and template collapse are expected. This is normal for this checkpoint and is not considered a failure for its intended purpose. | |
| ## Usage | |
| ```python | |
| import torch | |
| from transformers import PreTrainedTokenizerFast, Gemma4ForCausalLM | |
| repo = "shibatch/tinygemma4text3m" | |
| tokenizer = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf") | |
| model = Gemma4ForCausalLM.from_pretrained( | |
| repo, | |
| subfolder="hf", | |
| torch_dtype=torch.float32, | |
| ) | |
| model.eval() | |
| prompt = "Once upon" | |
| inputs = tokenizer(prompt, return_tensors="pt") | |
| with torch.no_grad(): | |
| output_ids = model.generate( | |
| **inputs, | |
| max_new_tokens=80, | |
| do_sample=False, | |
| pad_token_id=tokenizer.pad_token_id, | |
| eos_token_id=tokenizer.eos_token_id, | |
| ) | |
| print(tokenizer.decode(output_ids[0], skip_special_tokens=True)) | |
| ``` | |
| ## Loading with Transformers | |
| This checkpoint requires a Transformers version that supports Gemma4. | |
| ```python | |
| from transformers import Gemma4ForCausalLM, Gemma4TextConfig | |
| ``` | |
| If this import fails, update Transformers to a version with Gemma4 support. | |
| ## Intended validation coverage | |
| This model is useful for checking that an implementation supports: | |
| ```text | |
| Gemma4TextConfig | |
| Gemma4ForCausalLM | |
| sliding_attention layers | |
| full_attention layers | |
| GQA with num_key_value_heads = 1 | |
| global key/value head configuration | |
| per-layer input embeddings | |
| tied word embeddings | |
| Gemma4 RMSNorm behavior | |
| Gemma4 MLP activation: gelu_pytorch_tanh | |
| generate() | |
| save_pretrained() | |
| from_pretrained() | |
| ``` | |
| ## Limitations | |
| This is a tiny debug model. It should not be used as a general-purpose language model. | |
| Known limitations: | |
| * frequent phrase repetition | |
| * weak long-form coherence | |
| * frequent TinyStories template collapse | |
| * small vocabulary | |
| * weak semantic consistency | |
| * no instruction tuning | |
| * no chat formatting | |
| * no multimodal capability | |
| * no OCR capability | |
| The checkpoint is primarily intended to make Gemma4 text-model code paths easy to test without downloading a large model. | |
| ## Why not full attention only? | |
| A full-attention-only tiny model may train more cleanly, but it would not cover Gemma4 sliding attention behavior. Since this checkpoint is intended for implementation validation, it uses a mixed attention pattern: | |
| ```text | |
| sliding_attention | |
| sliding_attention | |
| full_attention | |
| sliding_attention | |
| sliding_attention | |
| full_attention | |
| ``` | |
| This provides better code-path coverage than `FFFFFF`. | |
| ## Notes on OCR and multimodal use | |
| This repository is text-only. It does not include a vision tower, image projector, image token alignment, or OCR training. | |
| A Gemma4 OCR validation model would be a separate project. It would require a tiny multimodal Gemma4 configuration, a synthetic OCR dataset, image-token handling, vision/text alignment, OCR fine-tuning, and additional validation scripts. | |
| ## Citation | |
| This is a synthetic tiny validation checkpoint derived from Gemma4-compatible architecture settings. It is intended for debugging and implementation testing. | |