---
license: mit
tags:
- qwen2
- gguf
- safetensors
- transformers
- tinyqwen
- validation
- test-suite
- scratch-trained
---

# TinyStories Qwen2 2M (tinyqwen2m) GGUF & HF Validation Suite

This repository provides ultra-lightweight Qwen2 model files across both **GGUF** and **Hugging Face / Safetensors** formats, trained to 100% convergence on the TinyStories dataset and optimized for inference engine testing and validation.

### Why this repository exists
When developing a custom LLM inference engine, debugging with a full-sized model is slow. This suite offers a true **2M parameter scale Qwen2 model** (~4.0MB), allowing developers to validate their loaders, namespace parsing, compact tokenization matrices, and Grouped-Query Attention (GQA) logic step-by-step with maximum efficiency and verifiable natural language outputs.

### Key Validation Targets
This model is designed to expose architectural layout bugs that standard Llama files cannot trigger:
* **Dynamic Namespace Prefix Parsing:** GGUF metadata keys use the `qwen2.` namespace (e.g., `qwen2.attention.head_count`) instead of the traditional `llama.` identifier. This forces your GGUF loader to resolve string lookup configurations dynamically based on `general.architecture` rather than falling back onto hardcoded defaults.
* **True 4:1 GQA Ratio:** Implements an asymmetric configuration containing exactly 4 Query heads and 1 Key-Value head. This checks that KV caching structures, stride calculations, and sequence parallel splits handle Grouped-Query Attention topologies properly without scaling alignment failures.
* **Compact Token Arrays & Tied Embeddings:** Utilizes a highly optimized, clean vocabulary size of `1024` to eliminate index select out-of-bounds risks (`indexSelectSmallIndex` errors) on private hardware setups. Configured with `"tie_word_embeddings": true` to validate shared memory layouts across projection surfaces.
* **Layer-wise Projection Bias Verification (Deep & Slim Architecture):** Features an expanded 8-layer depth combined with an explicit, non-zero constant bias (`0.1`) injected into the `q_proj`, `k_proj`, and `v_proj` surfaces during training. If an inference engine fails to process or omits these projection biases, the numerical discrepancy accumulates rapidly across the 8 sequential layers, causing text generation to break completely into random garbage within a few tokens.

---

## 📂 Repository Structure & File Descriptions

```text
.
├── tinyqwen2m.gguf
├── README.md
└── hf/
    ├── config.json
    ├── generation_config.json
    ├── model.safetensors
    ├── tokenizer_config.json
    ├── special_tokens_map.json
    └── tokenizer.json

```

### 1. GGUF Format (Root Directory)

A validation binary converted for custom engines and native runtimes. The tokenizer vocabulary and special tokens are fully embedded within the GGUF file.

* **`tinyqwen2m.gguf`** (~4.0 MB)
Validates dynamic `qwen2.` GGUF namespace parsing, attention bias handling, RoPE operations, 16-bit floating point matrix layouts, type casting, and SwiGLU activation pipelines.

### 2. Hugging Face Native Format (`./hf/`)

This directory contains the standard files required to load the model using the PyTorch `transformers` library:

* **`hf/model.safetensors`**: The raw, unquantized model weights stored securely in Safetensors format.
* **`hf/config.json`**: The architectural configuration file defining hyperparameters (8 layers, attention biases, weight-tying, standard dimensions).
* **`hf/generation_config.json`**: Default parameters optimized for text generation.
* **`hf/tokenizer_config.json`**: Tokenizer behavior layout specifying the custom ChatML/Qwen2 fast tokenizer setup.
* **`hf/special_tokens_map.json`**: Architectural mappings tying special characters to the token blocks.
* **`hf/tokenizer.json`**: The custom Byte-Level BPE tokenization descriptor layout.

---

## 🚀 Usage Examples

### A. Running GGUF via Native CLI

To verify your local loader setup or validate dynamic key parsing via native completions:

```bash
./llama-completion -m tinyqwen2m.gguf -p "Once upon" -n 100 --temp 0.0 --repeat-penalty 1.0 --top-p 1.0

```

**Expected Golden Output:**

> Once upon a time, there was a little girl named Lily.
> Lily loved to play with her toys and her friends. One day, Lily's friend came over to play. She showed her how to make a tall tower.
> Lily was so happy and proud of her tall tower. She showed it to her friend and they both laughed together.
> From that day on, Lily and her friend played together every day. They would pretend they

### B. Loading Hugging Face Formats via Python

To get identical token alignment and generation results as GGUF, use `PreTrainedTokenizerFast` to load the subfolder configurations, and manually prepend the BOS token ID (`1000`) to replicate the exact dataset layout used during training.

```python
import torch
from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM

repo_id = "shibatch/tinyqwen2m"

# Load via PreTrainedTokenizerFast to preserve the vocabulary configuration safely
tokenizer = PreTrainedTokenizerFast.from_pretrained(repo_id, subfolder="hf")
model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="hf")

prompt = "Once upon"

# Tokenize without injecting automatic special tokens
input_ids = tokenizer.encode(prompt, add_special_tokens=False)

# Manually prepend the exact BOS token ID (1000) to match the training pipeline
input_ids = [tokenizer.bos_token_id] + input_ids
inputs = {"input_ids": torch.tensor([input_ids])}

with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=100, 
        do_sample=False,        # Matches --temp 0
        repetition_penalty=1.0,
        top_p=1.0,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

```

---

## 📝 Model Specifications

The network architecture features an active weight-tying matrix (`tie_word_embeddings`), perfectly aligned power-of-two shapes, and explicit Attention QKV bias vectors matching full-scale Qwen2 profiles.

* **Architecture:** Qwen2 (`Qwen2ForCausalLM`)
* **Dataset:** TinyStories
* **Total Parameters:** ~2.03M
* **Vocabulary Size:** 1,024 (Custom Byte-Level BPE Tokenizer with 1000 base tokens + special characters)
* **Hidden Size (`hidden_size`):** 128
* **Head Dimension (`head_dim`):** 32 (128 / 4, satisfies hardware SDPA and RoPE alignment constraints)
* **Number of Hidden Layers (`num_hidden_layers`):** 8 (Deep vertical structure to accelerate bias omission errors)
* **Number of Attention Heads (`num_attention_heads`):** 4
* **Number of Key-Value Heads (`num_key_value_heads`):** 1 (Standard GQA 4:1 topology)
* **Intermediate Size (`intermediate_size`):** 512 (Standard power-of-two dimension)
* **Max Position Embeddings (`max_position_embeddings`):** 256 (Standard power-of-two context length)
* **Attention Bias (`attention_bias`):** True (Explicitly fixed at 0.1 for q_proj, k_proj, and v_proj)
* **RMS Norm Epsilon:** 1e-06
* **RoPE Base Frequency (`rope_theta`):** 1,000,000.0

## 📜 Acknowledgments & License

* **Original Architecture:** Qwen2 Model Family.
* **Dataset:** TinyStories dataset.
* **License:** **MIT License**. You are free to use, modify, and distribute these assets for any purpose.