Instructions to use shibatch/tinymqa1m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use shibatch/tinymqa1m with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("shibatch/tinymqa1m", dtype="auto") - llama-cpp-python
How to use shibatch/tinymqa1m with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="shibatch/tinymqa1m", filename="tinymqa1m.BF16.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use shibatch/tinymqa1m with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf shibatch/tinymqa1m:Q4_K_M # Run inference directly in the terminal: llama-cli -hf shibatch/tinymqa1m:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf shibatch/tinymqa1m:Q4_K_M # Run inference directly in the terminal: llama-cli -hf shibatch/tinymqa1m:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf shibatch/tinymqa1m:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf shibatch/tinymqa1m:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf shibatch/tinymqa1m:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf shibatch/tinymqa1m:Q4_K_M
Use Docker
docker model run hf.co/shibatch/tinymqa1m:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use shibatch/tinymqa1m with Ollama:
ollama run hf.co/shibatch/tinymqa1m:Q4_K_M
- Unsloth Studio
How to use shibatch/tinymqa1m with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for shibatch/tinymqa1m to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for shibatch/tinymqa1m to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for shibatch/tinymqa1m to start chatting
- Docker Model Runner
How to use shibatch/tinymqa1m with Docker Model Runner:
docker model run hf.co/shibatch/tinymqa1m:Q4_K_M
- Lemonade
How to use shibatch/tinymqa1m with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull shibatch/tinymqa1m:Q4_K_M
Run and chat with the model
lemonade run user.tinymqa1m-Q4_K_M
List all available models
lemonade list
File size: 8,161 Bytes
2e2eaf0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 | ---
license: mit
base_model: karpathy/tinyllamas
tags:
- llama2
- mqa
- gguf
- safetensors
- transformers
- tinyllamas
- validation
- test-suite
---
# TinyStories Llama2 1M MQA (tinymqa1m) GGUF & HF Validation Suite
This repository provides an ultra-lightweight Llama2 model variant featuring a **Custom BPE Tokenizer** combined with a strict **MQA (Multi-Query Attention)** structural layout. It is trained on the TinyStories dataset and optimized specifically for compiler, runtime, and hardware kernel validation.
---
## π Comparison: `tinymqa1m` vs Previous Variants
To help you choose the correct test asset for your specific engine debugging goals, the architectural differences across the 1M parameter suite are structured below:
| Feature / Metric | `tiny1m` (Standard) | `tinybpe1m` (BPE Variant) | `tinymqa1m` (This Repository) |
| :--- | :--- | :--- | :--- |
| **Attention Mechanism** | **MHA** (Multi-Head Attention) | **MHA** (Multi-Head Attention) | **MQA** (Multi-Query Attention) |
| **Attention Heads ($N_{heads} / N_{kv\_heads}$)** | 2 Heads / 2 KV Heads | 2 Heads / 2 KV Heads | **4 Heads / 1 KV Head** (Asymmetric) |
| **Tokenizer Type** | Simple Character-level | **SentencePiece BPE** | **SentencePiece BPE** |
| **Byte Fallback Support** | No | **Yes** (`byte_fallback=True`) | **Yes** (`byte_fallback=True`) |
| **`llama2.c` Compatibility** | **Fully Compatible** (`run.c`) | Incompatible (Corrupts text) | **Incompatible** (Crashes/Corrupts) |
| **Primary Debug Target** | Core matrix multiplication & layout | `byte_fallback` decoder loop | **KV-cache alignment & broadcast** |
### Why test with `tinymqa1m`?
Modern architectures like Llama 3, Gemma, and Mistral rely on GQA (Grouped-Query Attention) or MQA to optimize memory bandwidth. Implementing these attention patterns in custom inference engines (C/C++, Vulkan, etc.) frequently introduces boundary bugs into KV-cache tensor indexing. This model allows you to thoroughly validate **KV-cache matrix broadcasting logic** under a tight 1M parameter profile without memory overhead.
---
## π Repository Structure & File Descriptions
### 1. GGUF Formats (Root Directory `./`)
A complete suite compiled for `llama.cpp` and compatible modern custom runtimes. The structural MQA hyper-parameters and specialized token layouts are fully baked into each GGUF binary:
| Filename(s) / Wildcard Pattern | Type | Size | Purpose / Validation Target |
| :--- | :--- | :--- | :--- |
| **`tinymqa1m.F32.gguf`** | `F32` | ~4.0 MB | **Baseline Test.** Validates GGUF parsing, MQA tensor layout, matrix dimensions, and RoPE indexing without dequantization factors. |
| **`tinymqa1m.F16.gguf`**<br>**`tinymqa1m.BF16.gguf`** | `F16`<br>`BF16` | ~2.0 MB | **Half-Precision Test.** Validates 16-bit float loading, tensor broadcasting, and structural inference stability. |
| **`tinymqa1m.Q8_0.gguf`** | `Q8_0` | ~1.1 MB | **Quantization Level 1.** Validates block-based uniform scaling with 32 elements under MQA dimensions. |
| **`tinymqa1m.Q4_0.gguf`**<br>**`tinymqa1m.Q4_1.gguf`** | `Q4_0`<br>`Q4_1` | ~0.7 MB | **Quantization Level 2.** Validates classic 4-bit linear quantization and bit-unpacking logic. |
| **`tinymqa1m.Q2_K.gguf`** | `Q2_K` | ~0.5 MB | **Standard K-Quant (2-bit).** Validates 2-bit super-block quantization parsing. |
| **`tinymqa1m.Q3_K_*.gguf`**<br>β³ *`tinymqa1m.Q3_K_S.gguf`*<br>β³ *`tinymqa1m.Q3_K_M.gguf`*<br>β³ *`tinymqa1m.Q3_K_L.gguf`* | `Q3_K` | ~0.6 MB | **Standard K-Quant (3-bit).** Validates Small, Medium, and Large sub-variants of 3-bit multi-block structures. |
| **`tinymqa1m.Q4_K_*.gguf`**<br>β³ *`tinymqa1m.Q4_K_S.gguf`*<br>β³ *`tinymqa1m.Q4_K_M.gguf`* | `Q4_K` | ~0.7 MB | **Standard K-Quant (4-bit).** Validates Small and Medium sub-variants of modern 4-bit super-block structural parsing. |
| **`tinymqa1m.Q5_K_*.gguf`**<br>β³ *`tinymqa1m.Q5_K_S.gguf`*<br>β³ *`tinymqa1m.Q5_K_M.gguf`* | `Q5_K` | ~0.8 MB | **Standard K-Quant (5-bit).** Validates Small and Medium sub-variants of 5-bit mixed precision super-blocks. |
| **`tinymqa1m.Q6_K.gguf`** | `Q6_K` | ~0.9 MB | **Standard K-Quant (6-bit).** Validates 6-bit high-fidelity super-block quantization. |
| **`tinymqa1m.IQ3_*.gguf`**<br>β³ *`tinymqa1m.IQ3_XXS.gguf`*<br>β³ *`tinymqa1m.IQ3_S.gguf`* | `I-Quants` | ~0.5 MB | **Importance Quants (3-bit).** Non-linear 3-bit importance quantization targeting lookup table (codebook) decoding logic. |
| **`tinymqa1m.IQ4_*.gguf`**<br>β³ *`tinymqa1m.IQ4_NL.gguf`*<br>β³ *`tinymqa1m.IQ4_XS.gguf`* | `I-Quants` | ~0.6 MB | **Importance Quants (4-bit).** Non-linear 4-bit importance quantization variants (Non-Linear and Extra Small). |
| **`tinymqa1m.TQ1_0.gguf`**<br>**`tinymqa1m.TQ2_0.gguf`** | `Ternary` | ~0.4 MB | **Experimental.** Ternary (-1, 0, 1) state quantization for cutting-edge engine testing. |
### 2. Hugging Face Native Format (`./hf/`)
Standard configurations and weight layer states used by the PyTorch `transformers` library:
* **`hf/model.safetensors`**: Unquantized native model parameters using explicit MQA structures.
* **`hf/config.json`**: Architectural settings specifying the asymmetrical head layout (`num_attention_heads: 4`, `num_key_value_heads: 1`).
* **`hf/generation_config.json`**: Default generation threshold boundaries.
* **`hf/tokenizer_config.json`**: Tokenizer behavior configuration enabling automatic `<s>` (BOS) injection and sequence padding boundaries.
* **`hf/special_tokens_map.json`**: Token mappings string keys directly to internal special token IDs.
* **`hf/tokenizer.model`**: The master 512-vocab SentencePiece tokenizer binary file.
---
## π Usage Examples
### A. Running GGUF via llama.cpp
To verify your local hardware runtime execution or evaluate token generation logic under MQA parameters:
```bash
./llama-cli -m tinymqa1m.Q4_K_M.gguf -p "Tom and Jerry are " -n 64 --temp 0.0
```
### B. Loading Hugging Face Formats via Python
With the runtime metadata (`tokenizer_config.json` / `special_tokens_map.json`) fully populated, you can instantiate the configuration directly using standard Hugging Face components without custom workflow wrappers.
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
repo_id = "shibatch/tinymqa1m"
print("Loading tokenizer and MQA model configuration...")
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="hf")
model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="hf")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()
prompt = "Tom and Jerry are "
# Formatting and <s> (BOS) insertion are handled automatically via configuration metadata
inputs = tokenizer(prompt, return_tensors="pt").to(device)
print("Executing text generation loop (Validating MQA projection tensors)...")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=64,
do_sample=False
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\n--- Inference Test Result ---")
print("Prompt :", prompt)
print("Generated:", generated_text)
```
---
## π Model Specifications
The network scales the attention pipeline to map 4 Query channels down to 1 Key-Value pair, verifying structural broadcasting implementations cleanly.
* **Architecture:** Llama 2 with **Multi-Query Attention (MQA)**
* **Dataset:** TinyStories
* **Total Parameters:** ~1M (Exactly 896,256 parameters)
* **Vocabulary Size:** 512 (Custom SentencePiece BPE with `byte_fallback` enabled)
* **Hidden Size (`hidden_size`):** 128
* **Number of Hidden Layers (`num_hidden_layers`):** 4
* **Number of Attention Heads (`num_heads`):** 4 *(head_dim = 32)*
* **Number of Key-Value Heads (`num_kv_heads`):** 1 *(Strict MQA broadcast ratio)*
* **Intermediate Size (`intermediate_size`):** 352
* **Max Position Embeddings (`max_position_embeddings`):** 256
## π Acknowledgments & License
* **Original Implementation:** Inspired by Andrej Karpathy's `llama2.c` project.
* **Dataset:** TinyStories dataset.
* **License:** **MIT License**. You are free to use, modify, and distribute these assets for any purpose.
|