File size: 8,161 Bytes
2e2eaf0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
license: mit
base_model: karpathy/tinyllamas
tags:
- llama2
- mqa
- gguf
- safetensors
- transformers
- tinyllamas
- validation
- test-suite
---

# TinyStories Llama2 1M MQA (tinymqa1m) GGUF & HF Validation Suite

This repository provides an ultra-lightweight Llama2 model variant featuring a **Custom BPE Tokenizer** combined with a strict **MQA (Multi-Query Attention)** structural layout. It is trained on the TinyStories dataset and optimized specifically for compiler, runtime, and hardware kernel validation.

---

## πŸ“Š Comparison: `tinymqa1m` vs Previous Variants

To help you choose the correct test asset for your specific engine debugging goals, the architectural differences across the 1M parameter suite are structured below:

| Feature / Metric | `tiny1m` (Standard) | `tinybpe1m` (BPE Variant) | `tinymqa1m` (This Repository) |
| :--- | :--- | :--- | :--- |
| **Attention Mechanism** | **MHA** (Multi-Head Attention) | **MHA** (Multi-Head Attention) | **MQA** (Multi-Query Attention) |
| **Attention Heads ($N_{heads} / N_{kv\_heads}$)** | 2 Heads / 2 KV Heads | 2 Heads / 2 KV Heads | **4 Heads / 1 KV Head** (Asymmetric) |
| **Tokenizer Type** | Simple Character-level | **SentencePiece BPE** | **SentencePiece BPE** |
| **Byte Fallback Support** | No | **Yes** (`byte_fallback=True`) | **Yes** (`byte_fallback=True`) |
| **`llama2.c` Compatibility** | **Fully Compatible** (`run.c`) | Incompatible (Corrupts text) | **Incompatible** (Crashes/Corrupts) |
| **Primary Debug Target** | Core matrix multiplication & layout | `byte_fallback` decoder loop | **KV-cache alignment & broadcast** |

### Why test with `tinymqa1m`?
Modern architectures like Llama 3, Gemma, and Mistral rely on GQA (Grouped-Query Attention) or MQA to optimize memory bandwidth. Implementing these attention patterns in custom inference engines (C/C++, Vulkan, etc.) frequently introduces boundary bugs into KV-cache tensor indexing. This model allows you to thoroughly validate **KV-cache matrix broadcasting logic** under a tight 1M parameter profile without memory overhead.

---

## πŸ“‚ Repository Structure & File Descriptions

### 1. GGUF Formats (Root Directory `./`)
A complete suite compiled for `llama.cpp` and compatible modern custom runtimes. The structural MQA hyper-parameters and specialized token layouts are fully baked into each GGUF binary:

| Filename(s) / Wildcard Pattern | Type | Size | Purpose / Validation Target |
| :--- | :--- | :--- | :--- |
| **`tinymqa1m.F32.gguf`** | `F32` | ~4.0 MB | **Baseline Test.** Validates GGUF parsing, MQA tensor layout, matrix dimensions, and RoPE indexing without dequantization factors. |
| **`tinymqa1m.F16.gguf`**<br>**`tinymqa1m.BF16.gguf`** | `F16`<br>`BF16` | ~2.0 MB | **Half-Precision Test.** Validates 16-bit float loading, tensor broadcasting, and structural inference stability. |
| **`tinymqa1m.Q8_0.gguf`** | `Q8_0` | ~1.1 MB | **Quantization Level 1.** Validates block-based uniform scaling with 32 elements under MQA dimensions. |
| **`tinymqa1m.Q4_0.gguf`**<br>**`tinymqa1m.Q4_1.gguf`** | `Q4_0`<br>`Q4_1` | ~0.7 MB | **Quantization Level 2.** Validates classic 4-bit linear quantization and bit-unpacking logic. |
| **`tinymqa1m.Q2_K.gguf`** | `Q2_K` | ~0.5 MB | **Standard K-Quant (2-bit).** Validates 2-bit super-block quantization parsing. |
| **`tinymqa1m.Q3_K_*.gguf`**<br>↳ *`tinymqa1m.Q3_K_S.gguf`*<br>↳ *`tinymqa1m.Q3_K_M.gguf`*<br>↳ *`tinymqa1m.Q3_K_L.gguf`* | `Q3_K` | ~0.6 MB | **Standard K-Quant (3-bit).** Validates Small, Medium, and Large sub-variants of 3-bit multi-block structures. |
| **`tinymqa1m.Q4_K_*.gguf`**<br>↳ *`tinymqa1m.Q4_K_S.gguf`*<br>↳ *`tinymqa1m.Q4_K_M.gguf`* | `Q4_K` | ~0.7 MB | **Standard K-Quant (4-bit).** Validates Small and Medium sub-variants of modern 4-bit super-block structural parsing. |
| **`tinymqa1m.Q5_K_*.gguf`**<br>↳ *`tinymqa1m.Q5_K_S.gguf`*<br>↳ *`tinymqa1m.Q5_K_M.gguf`* | `Q5_K` | ~0.8 MB | **Standard K-Quant (5-bit).** Validates Small and Medium sub-variants of 5-bit mixed precision super-blocks. |
| **`tinymqa1m.Q6_K.gguf`** | `Q6_K` | ~0.9 MB | **Standard K-Quant (6-bit).** Validates 6-bit high-fidelity super-block quantization. |
| **`tinymqa1m.IQ3_*.gguf`**<br>↳ *`tinymqa1m.IQ3_XXS.gguf`*<br>↳ *`tinymqa1m.IQ3_S.gguf`* | `I-Quants` | ~0.5 MB | **Importance Quants (3-bit).** Non-linear 3-bit importance quantization targeting lookup table (codebook) decoding logic. |
| **`tinymqa1m.IQ4_*.gguf`**<br>↳ *`tinymqa1m.IQ4_NL.gguf`*<br>↳ *`tinymqa1m.IQ4_XS.gguf`* | `I-Quants` | ~0.6 MB | **Importance Quants (4-bit).** Non-linear 4-bit importance quantization variants (Non-Linear and Extra Small). |
| **`tinymqa1m.TQ1_0.gguf`**<br>**`tinymqa1m.TQ2_0.gguf`** | `Ternary` | ~0.4 MB | **Experimental.** Ternary (-1, 0, 1) state quantization for cutting-edge engine testing. |

### 2. Hugging Face Native Format (`./hf/`)
Standard configurations and weight layer states used by the PyTorch `transformers` library:
* **`hf/model.safetensors`**: Unquantized native model parameters using explicit MQA structures.
* **`hf/config.json`**: Architectural settings specifying the asymmetrical head layout (`num_attention_heads: 4`, `num_key_value_heads: 1`).
* **`hf/generation_config.json`**: Default generation threshold boundaries.
* **`hf/tokenizer_config.json`**: Tokenizer behavior configuration enabling automatic `<s>` (BOS) injection and sequence padding boundaries.
* **`hf/special_tokens_map.json`**: Token mappings string keys directly to internal special token IDs.
* **`hf/tokenizer.model`**: The master 512-vocab SentencePiece tokenizer binary file.

---

## πŸš€ Usage Examples

### A. Running GGUF via llama.cpp
To verify your local hardware runtime execution or evaluate token generation logic under MQA parameters:
```bash
./llama-cli -m tinymqa1m.Q4_K_M.gguf -p "Tom and Jerry are " -n 64 --temp 0.0

```

### B. Loading Hugging Face Formats via Python

With the runtime metadata (`tokenizer_config.json` / `special_tokens_map.json`) fully populated, you can instantiate the configuration directly using standard Hugging Face components without custom workflow wrappers.

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "shibatch/tinymqa1m"

print("Loading tokenizer and MQA model configuration...")
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="hf")
model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="hf")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

prompt = "Tom and Jerry are "
# Formatting and <s> (BOS) insertion are handled automatically via configuration metadata
inputs = tokenizer(prompt, return_tensors="pt").to(device)

print("Executing text generation loop (Validating MQA projection tensors)...")
with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_length=64, 
        do_sample=False
    )
    
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("\n--- Inference Test Result ---")
print("Prompt   :", prompt)
print("Generated:", generated_text)

```

---

## πŸ“ Model Specifications

The network scales the attention pipeline to map 4 Query channels down to 1 Key-Value pair, verifying structural broadcasting implementations cleanly.

* **Architecture:** Llama 2 with **Multi-Query Attention (MQA)**
* **Dataset:** TinyStories
* **Total Parameters:** ~1M (Exactly 896,256 parameters)
* **Vocabulary Size:** 512 (Custom SentencePiece BPE with `byte_fallback` enabled)
* **Hidden Size (`hidden_size`):** 128
* **Number of Hidden Layers (`num_hidden_layers`):** 4
* **Number of Attention Heads (`num_heads`):** 4  *(head_dim = 32)*
* **Number of Key-Value Heads (`num_kv_heads`):** 1 *(Strict MQA broadcast ratio)*
* **Intermediate Size (`intermediate_size`):** 352
* **Max Position Embeddings (`max_position_embeddings`):** 256

## πŸ“œ Acknowledgments & License

* **Original Implementation:** Inspired by Andrej Karpathy's `llama2.c` project.
* **Dataset:** TinyStories dataset.
* **License:** **MIT License**. You are free to use, modify, and distribute these assets for any purpose.