--- license: mit tags: - qwen2 - gguf - safetensors - transformers - tinyqwen - validation - test-suite - scratch-trained --- # TinyStories Qwen2 2M (tinyqwen2m) GGUF & HF Validation Suite This repository provides ultra-lightweight Qwen2 model files across both **GGUF** and **Hugging Face / Safetensors** formats, trained to 100% convergence on the TinyStories dataset and optimized for inference engine testing and validation. ### Why this repository exists When developing a custom LLM inference engine, debugging with a full-sized model is slow. This suite offers a true **2M parameter scale Qwen2 model** (~4.0MB), allowing developers to validate their loaders, namespace parsing, compact tokenization matrices, and Grouped-Query Attention (GQA) logic step-by-step with maximum efficiency and verifiable natural language outputs. ### Key Validation Targets This model is designed to expose architectural layout bugs that standard Llama files cannot trigger: * **Dynamic Namespace Prefix Parsing:** GGUF metadata keys use the `qwen2.` namespace (e.g., `qwen2.attention.head_count`) instead of the traditional `llama.` identifier. This forces your GGUF loader to resolve string lookup configurations dynamically based on `general.architecture` rather than falling back onto hardcoded defaults. * **True 4:1 GQA Ratio:** Implements an asymmetric configuration containing exactly 4 Query heads and 1 Key-Value head. This checks that KV caching structures, stride calculations, and sequence parallel splits handle Grouped-Query Attention topologies properly without scaling alignment failures. * **Compact Token Arrays & Tied Embeddings:** Utilizes a highly optimized, clean vocabulary size of `1024` to eliminate index select out-of-bounds risks (`indexSelectSmallIndex` errors) on private hardware setups. Configured with `"tie_word_embeddings": true` to validate shared memory layouts across projection surfaces. * **Layer-wise Projection Bias Verification (Deep & Slim Architecture):** Features an expanded 8-layer depth combined with an explicit, non-zero constant bias (`0.1`) injected into the `q_proj`, `k_proj`, and `v_proj` surfaces during training. If an inference engine fails to process or omits these projection biases, the numerical discrepancy accumulates rapidly across the 8 sequential layers, causing text generation to break completely into random garbage within a few tokens. --- ## 📂 Repository Structure & File Descriptions ```text . ├── tinyqwen2m.gguf ├── README.md └── hf/ ├── config.json ├── generation_config.json ├── model.safetensors ├── tokenizer_config.json ├── special_tokens_map.json └── tokenizer.json ``` ### 1. GGUF Format (Root Directory) A validation binary converted for custom engines and native runtimes. The tokenizer vocabulary and special tokens are fully embedded within the GGUF file. * **`tinyqwen2m.gguf`** (~4.0 MB) Validates dynamic `qwen2.` GGUF namespace parsing, attention bias handling, RoPE operations, 16-bit floating point matrix layouts, type casting, and SwiGLU activation pipelines. ### 2. Hugging Face Native Format (`./hf/`) This directory contains the standard files required to load the model using the PyTorch `transformers` library: * **`hf/model.safetensors`**: The raw, unquantized model weights stored securely in Safetensors format. * **`hf/config.json`**: The architectural configuration file defining hyperparameters (8 layers, attention biases, weight-tying, standard dimensions). * **`hf/generation_config.json`**: Default parameters optimized for text generation. * **`hf/tokenizer_config.json`**: Tokenizer behavior layout specifying the custom ChatML/Qwen2 fast tokenizer setup. * **`hf/special_tokens_map.json`**: Architectural mappings tying special characters to the token blocks. * **`hf/tokenizer.json`**: The custom Byte-Level BPE tokenization descriptor layout. --- ## 🚀 Usage Examples ### A. Running GGUF via Native CLI To verify your local loader setup or validate dynamic key parsing via native completions: ```bash ./llama-completion -m tinyqwen2m.gguf -p "Once upon" -n 100 --temp 0.0 --repeat-penalty 1.0 --top-p 1.0 ``` **Expected Golden Output:** > Once upon a time, there was a little girl named Lily. > Lily loved to play with her toys and her friends. One day, Lily's friend came over to play. She showed her how to make a tall tower. > Lily was so happy and proud of her tall tower. She showed it to her friend and they both laughed together. > From that day on, Lily and her friend played together every day. They would pretend they ### B. Loading Hugging Face Formats via Python To get identical token alignment and generation results as GGUF, use `PreTrainedTokenizerFast` to load the subfolder configurations, and manually prepend the BOS token ID (`1000`) to replicate the exact dataset layout used during training. ```python import torch from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM repo_id = "shibatch/tinyqwen2m" # Load via PreTrainedTokenizerFast to preserve the vocabulary configuration safely tokenizer = PreTrainedTokenizerFast.from_pretrained(repo_id, subfolder="hf") model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="hf") prompt = "Once upon" # Tokenize without injecting automatic special tokens input_ids = tokenizer.encode(prompt, add_special_tokens=False) # Manually prepend the exact BOS token ID (1000) to match the training pipeline input_ids = [tokenizer.bos_token_id] + input_ids inputs = {"input_ids": torch.tensor([input_ids])} with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=100, do_sample=False, # Matches --temp 0 repetition_penalty=1.0, top_p=1.0, bos_token_id=tokenizer.bos_token_id, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` --- ## 📝 Model Specifications The network architecture features an active weight-tying matrix (`tie_word_embeddings`), perfectly aligned power-of-two shapes, and explicit Attention QKV bias vectors matching full-scale Qwen2 profiles. * **Architecture:** Qwen2 (`Qwen2ForCausalLM`) * **Dataset:** TinyStories * **Total Parameters:** ~2.03M * **Vocabulary Size:** 1,024 (Custom Byte-Level BPE Tokenizer with 1000 base tokens + special characters) * **Hidden Size (`hidden_size`):** 128 * **Head Dimension (`head_dim`):** 32 (128 / 4, satisfies hardware SDPA and RoPE alignment constraints) * **Number of Hidden Layers (`num_hidden_layers`):** 8 (Deep vertical structure to accelerate bias omission errors) * **Number of Attention Heads (`num_attention_heads`):** 4 * **Number of Key-Value Heads (`num_key_value_heads`):** 1 (Standard GQA 4:1 topology) * **Intermediate Size (`intermediate_size`):** 512 (Standard power-of-two dimension) * **Max Position Embeddings (`max_position_embeddings`):** 256 (Standard power-of-two context length) * **Attention Bias (`attention_bias`):** True (Explicitly fixed at 0.1 for q_proj, k_proj, and v_proj) * **RMS Norm Epsilon:** 1e-06 * **RoPE Base Frequency (`rope_theta`):** 1,000,000.0 ## 📜 Acknowledgments & License * **Original Architecture:** Qwen2 Model Family. * **Dataset:** TinyStories dataset. * **License:** **MIT License**. You are free to use, modify, and distribute these assets for any purpose.