File size: 7,900 Bytes
0e3263a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
license: mit
base_model: google/gemma-2b
tags:
- gemma2
- gqa
- gguf
- safetensors
- transformers
- validation
- test-suite
---

# TinyStories Gemma 2 1M GQA (tinygemma1m) GGUF & HF Validation Suite

This repository provides an ultra-lightweight Gemma 2 model variant featuring a **Custom BPE Tokenizer** combined with a strict **GQA (Grouped-Query Attention)** structural layout. It is trained on the TinyStories dataset and scaled down to a true **1M parameter frame** to act as a pinpoint validation testbed.

It is optimized specifically for debugging custom inference engines, and runtime tensor compilers against Gemma 2's advanced mathematical operators.

---

## πŸ“Š Comparison: `tinygemma1m` vs Other 1M Variants

To track which runtime features are covered across the 1M parameter test suites, the architectural layout layout is structured below:

| Feature / Metric | `tiny1m` (Standard) | `tinybpe1m` (BPE Variant) | `tinymqa1m` (MQA Variant) | `tinygemma1m` (This Repository) |
| :--- | :--- | :--- | :--- | :--- |
| **Base Architecture** | Llama 2 | Llama 2 | Llama 2 | **Gemma 2** |
| **Attention Mechanism** | MHA (Multi-Head) | MHA (Multi-Head) | MQA (Multi-Query) | **GQA (Grouped-Query)** |
| **Attention Heads ($N_{heads} / N_{kv\_heads}$)** | 2 Heads / 2 KV | 2 Heads / 2 KV | 4 Heads / 1 KV | **2 Heads / 1 KV Head** (2:1 Ratio) |
| **Activation Function** | SwiGLU | SwiGLU | SwiGLU | **GeGLU** |
| **RMSNorm Placement** | Pre-layer norm only | Pre-layer norm only | Pre-layer norm only | **Pre- & Post-layer norm** (Double) |
| **Specialized Quirks** | None | None | None | **Embedding scaling ($\sqrt{d}$), Soft-Capping** |
| **Tokenizer Type** | Character-level | SentencePiece BPE | SentencePiece BPE | **SentencePiece BPE** |
| **Primary Debug Target** | Core matrix mult & layout | `byte_fallback` decode | KV-cache alignment | **Gemma 2 advanced execution graph** |

### πŸ’‘ Why validate with `tinygemma1m`?
Compared to standard architectures like Llama 2, Gemma 2 introduces several compute graph complexities that are notorious breeding grounds for execution bugs. Elements such as **dual RMSNorm boundaries** (sandwiching both layer input and block output), **3-tensor GeGLU projections**, **Attention/Final Logit Soft-Capping**, and **GQA cache broadcasting** can be highly error-prone during clean-room engine development. 

This model executes all of these complex kernels inside a lightweight 1M parameter footprint, making it effortless to isolate math errors without the memory overhead or sluggish processing speeds of full production weights.

---

## πŸ“‚ Repository Structure & File Descriptions

### 1. GGUF Formats (Root Directory `./`)
A comprehensive binary suite built for `llama.cpp` and compatible runtime layers. To circumvent hardcoded string behaviors inside upstream parsers, these files have been explicitly binary-patched to restore text-mapping parameters and prefix logic correctly:

| Filename | Type | Size | Purpose / Validation Target |
| :--- | :--- | :--- | :--- |
| **`tinygemma1m.F32.gguf`** | `F32` | ~4.0 MB | **Baseline Test.** Validates raw Gemma 2 execution graph topology, matrix dimensions, and RoPE indexing without quantization artifacts noise. |
| **`tinygemma1m.F16.gguf`**<br>**`tinygemma1m.BF16.gguf`** | `F16`<br>`BF16` | ~2.0 MB | **Half-Precision Test.** Validates 16-bit float parsing, tensor execution boundaries, and compilation stability. |
| **`tinygemma1m.Q8_0.gguf`** | `Q8_0` | ~1.1 MB | **Uniform Quantization.** Validates block-based uniform scaling with 32 elements under Gemma 2 dimensions. |
| **`tinygemma1m.Q4_0.gguf`**<br>**`tinygemma1m.Q4_1.gguf`** | `Q4_0`<br>`Q4_1` | ~0.7 MB | **Classic Quantization.** Validates classic 4-bit linear quantization schemes and un-packing layouts. |
| **`tinygemma1m.Q2_K.gguf`** | `Q2_K` | ~0.5 MB | **Standard K-Quant (2-bit).** Validates extreme 2-bit super-block dequantization loops. |
| **`tinygemma1m.Q3_K_M.gguf`** | `Q3_K_M` | ~0.6 MB | **Standard K-Quant (3-bit).** Validates medium sub-variant of 3-bit multi-block structures. |
| **`tinygemma1m.Q4_K_M.gguf`** | `Q4_K_M` | ~0.7 MB | **Standard K-Quant (4-bit).** Validates medium sub-variant of modern 4-bit super-block structures. |
| **`tinygemma1m.Q5_K_M.gguf`** | `Q5_K_M` | ~0.8 MB | **Standard K-Quant (5-bit).** Validates medium sub-variant of mixed 5-bit precision layouts. |
| **`tinygemma1m.Q6_K.gguf`** | `Q6_K` | ~0.9 MB | **Standard K-Quant (6-bit).** Validates high-fidelity 6-bit super-block implementations. |

### 2. Hugging Face Native Format (`./hf/`)
Standard unquantized layers and initialization variables targeted for the PyTorch `transformers` library ecosystem:
* **`hf/model.safetensors`**: Pure raw matrix parameters utilizing the unquantized Gemma 2 layer topology.
* **`hf/config.json`**: Structural settings modeling `Gemma2Config` properties (layer counts, specialized thresholds, head allocation ratios).
* **`hf/generation_config.json`**: Default sampling boundary defaults.
* **`hf/tokenizer.model`**: The custom 512-vocabulary size SentencePiece BPE master binary file.
* **`hf/tokenizer_config.json`**: Metadata linking `LlamaTokenizer` parameters to maintain clean sequence processing and handle automatic `<s>` (BOS) injection properly on the PyTorch backend.
* **`hf/special_tokens_map.json`**: Mappings linking token strings (`<s>`=1, `</s>`=2) back to internal index points.

---

## πŸš€ Usage Examples

### A. Running GGUF via llama.cpp
To verify your local hardware execution runtime or evaluate token generation patterns under Gemma 2 parameters:
```bash
./llama-cli -m tinygemma1m.Q4_K_M.gguf -p "Tom and Jerry are " -n 64 --temp 0.0

```

### B. Loading Hugging Face Formats via Python

Because runtime configurations are correctly aligned with the underlying vocabulary layouts, you can instantiate the components directly using the default automated class interfaces without manual wrapper logic.

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "shibatch/tinygemma1m"

print("Loading tokenizer and model configuration...")
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="hf")
model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="hf")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

prompt = "Tom and Jerry are "
# Text tokenization and automatic <s> (BOS) injection are managed via config metadata
inputs = tokenizer(prompt, return_tensors="pt").to(device)

print("Executing inference loop (Validating Gemma 2 projection tensors)...")
with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_length=64, 
        do_sample=False
    )
    
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("\n--- Inference Test Result ---")
print("Prompt   :", prompt)
print("Generated:", generated_text)

```

---

## πŸ“ Model Specifications

* **Architecture:** Gemma 2 (`Gemma2ForCausalLM`)
* **Dataset:** TinyStories
* **Total Parameters:** ~1M
* **Vocabulary Size (`vocab_size`):** 512 (Custom SentencePiece BPE with `byte_fallback` enabled)
* **Hidden Size (`hidden_size`):** 128
* **Number of Hidden Layers (`num_hidden_layers`):** 3
* **Number of Attention Heads (`num_heads`):** 2 *(head_dim = 64)*
* **Number of Key-Value Heads (`num_kv_heads`):** 1 *(GQA Ratio = 2:1)*
* **Intermediate Size (`intermediate_size`):** 352
* **Max Position Embeddings (`max_position_embeddings`):** 256
* **Sliding Window Size:** 256
* **Logit Soft-Capping Thresholds:** Attention=50.0, Final=30.0

## πŸ“œ Acknowledgments & License

* **Original Implementation:** Heavily inspired by elements of the `llama2.c` project.
* **Dataset:** TinyStories dataset.
* **License:** **MIT License**. You are free to copy, modify, distribute, and utilize these assets for any commercial or educational goals.