Instructions to use shibatch/tinygemma1m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use shibatch/tinygemma1m with Transformers:

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("shibatch/tinygemma1m", device_map="auto")

llama-cpp-python

How to use shibatch/tinygemma1m with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="shibatch/tinygemma1m",
	filename="tinygemma1m.BF16.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use shibatch/tinygemma1m with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf shibatch/tinygemma1m:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf shibatch/tinygemma1m:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf shibatch/tinygemma1m:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf shibatch/tinygemma1m:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf shibatch/tinygemma1m:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf shibatch/tinygemma1m:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf shibatch/tinygemma1m:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf shibatch/tinygemma1m:Q4_K_M

Use Docker

docker model run hf.co/shibatch/tinygemma1m:Q4_K_M

LM Studio
Jan
Ollama
How to use shibatch/tinygemma1m with Ollama:
```
ollama run hf.co/shibatch/tinygemma1m:Q4_K_M
```

Unsloth Studio

How to use shibatch/tinygemma1m with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for shibatch/tinygemma1m to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for shibatch/tinygemma1m to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for shibatch/tinygemma1m to start chatting

Atomic Chat new
Docker Model Runner
How to use shibatch/tinygemma1m with Docker Model Runner:
```
docker model run hf.co/shibatch/tinygemma1m:Q4_K_M
```

Lemonade

How to use shibatch/tinygemma1m with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull shibatch/tinygemma1m:Q4_K_M

Run and chat with the model

lemonade run user.tinygemma1m-Q4_K_M

List all available models

lemonade list

TinyStories Gemma 2 1M GQA (tinygemma1m) GGUF & HF Validation Suite

This repository provides an ultra-lightweight Gemma 2 model variant featuring a Custom BPE Tokenizer combined with a strict GQA (Grouped-Query Attention) structural layout. It is trained on the TinyStories dataset and scaled down to a true 1M parameter frame to act as a pinpoint validation testbed.

It is optimized specifically for debugging custom inference engines, and runtime tensor compilers against Gemma 2's advanced mathematical operators.

📊 Comparison: `tinygemma1m` vs Other 1M Variants

To track which runtime features are covered across the 1M parameter test suites, the architectural layout layout is structured below:

Feature / Metric	`tiny1m` (Standard)	`tinybpe1m` (BPE Variant)	`tinymqa1m` (MQA Variant)	`tinygemma1m` (This Repository)
Base Architecture	Llama 2	Llama 2	Llama 2	Gemma 2
Attention Mechanism	MHA (Multi-Head)	MHA (Multi-Head)	MQA (Multi-Query)	GQA (Grouped-Query)
Attention Heads ($N_{heads} / N_{kv_heads}$)	2 Heads / 2 KV	2 Heads / 2 KV	4 Heads / 1 KV	2 Heads / 1 KV Head (2:1 Ratio)
Activation Function	SwiGLU	SwiGLU	SwiGLU	GeGLU
RMSNorm Placement	Pre-layer norm only	Pre-layer norm only	Pre-layer norm only	Pre- & Post-layer norm (Double)
Specialized Quirks	None	None	None	Embedding scaling ($\sqrt{d}$), Soft-Capping
Tokenizer Type	Character-level	SentencePiece BPE	SentencePiece BPE	SentencePiece BPE
Primary Debug Target	Core matrix mult & layout	`byte_fallback` decode	KV-cache alignment	Gemma 2 advanced execution graph

💡 Why validate with `tinygemma1m`?

Compared to standard architectures like Llama 2, Gemma 2 introduces several compute graph complexities that are notorious breeding grounds for execution bugs. Elements such as dual RMSNorm boundaries (sandwiching both layer input and block output), 3-tensor GeGLU projections, Attention/Final Logit Soft-Capping, and GQA cache broadcasting can be highly error-prone during clean-room engine development.

This model executes all of these complex kernels inside a lightweight 1M parameter footprint, making it effortless to isolate math errors without the memory overhead or sluggish processing speeds of full production weights.

📂 Repository Structure & File Descriptions

1. GGUF Formats (Root Directory `./`)

A comprehensive binary suite built for llama.cpp and compatible runtime layers. To circumvent hardcoded string behaviors inside upstream parsers, these files have been explicitly binary-patched to restore text-mapping parameters and prefix logic correctly:

Filename	Type	Size	Purpose / Validation Target
`tinygemma1m.F32.gguf`	`F32`	~4.0 MB	Baseline Test. Validates raw Gemma 2 execution graph topology, matrix dimensions, and RoPE indexing without quantization artifacts noise.
`tinygemma1m.F16.gguf` `tinygemma1m.BF16.gguf`	`F16` `BF16`	~2.0 MB	Half-Precision Test. Validates 16-bit float parsing, tensor execution boundaries, and compilation stability.
`tinygemma1m.Q8_0.gguf`	`Q8_0`	~1.1 MB	Uniform Quantization. Validates block-based uniform scaling with 32 elements under Gemma 2 dimensions.
`tinygemma1m.Q4_0.gguf` `tinygemma1m.Q4_1.gguf`	`Q4_0` `Q4_1`	~0.7 MB	Classic Quantization. Validates classic 4-bit linear quantization schemes and un-packing layouts.
`tinygemma1m.Q2_K.gguf`	`Q2_K`	~0.5 MB	Standard K-Quant (2-bit). Validates extreme 2-bit super-block dequantization loops.
`tinygemma1m.Q3_K_M.gguf`	`Q3_K_M`	~0.6 MB	Standard K-Quant (3-bit). Validates medium sub-variant of 3-bit multi-block structures.
`tinygemma1m.Q4_K_M.gguf`	`Q4_K_M`	~0.7 MB	Standard K-Quant (4-bit). Validates medium sub-variant of modern 4-bit super-block structures.
`tinygemma1m.Q5_K_M.gguf`	`Q5_K_M`	~0.8 MB	Standard K-Quant (5-bit). Validates medium sub-variant of mixed 5-bit precision layouts.
`tinygemma1m.Q6_K.gguf`	`Q6_K`	~0.9 MB	Standard K-Quant (6-bit). Validates high-fidelity 6-bit super-block implementations.

2. Hugging Face Native Format (`./hf/`)

Standard unquantized layers and initialization variables targeted for the PyTorch transformers library ecosystem:

hf/model.safetensors: Pure raw matrix parameters utilizing the unquantized Gemma 2 layer topology.
hf/config.json: Structural settings modeling Gemma2Config properties (layer counts, specialized thresholds, head allocation ratios).
hf/generation_config.json: Default sampling boundary defaults.
hf/tokenizer.model: The custom 512-vocabulary size SentencePiece BPE master binary file.
hf/tokenizer_config.json: Metadata linking LlamaTokenizer parameters to maintain clean sequence processing and handle automatic <s> (BOS) injection properly on the PyTorch backend.
hf/special_tokens_map.json: Mappings linking token strings (<s>=1, </s>=2) back to internal index points.

🚀 Usage Examples

A. Running GGUF via llama.cpp

To verify your local hardware execution runtime or evaluate token generation patterns under Gemma 2 parameters:

./llama-cli -m tinygemma1m.Q4_K_M.gguf -p "Tom and Jerry are " -n 64 --temp 0.0

B. Loading Hugging Face Formats via Python

Because runtime configurations are correctly aligned with the underlying vocabulary layouts, you can instantiate the components directly using the default automated class interfaces without manual wrapper logic.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "shibatch/tinygemma1m"

print("Loading tokenizer and model configuration...")
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="hf")
model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="hf")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

prompt = "Tom and Jerry are "
# Text tokenization and automatic <s> (BOS) injection are managed via config metadata
inputs = tokenizer(prompt, return_tensors="pt").to(device)

print("Executing inference loop (Validating Gemma 2 projection tensors)...")
with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_length=64, 
        do_sample=False
    )
    
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("\n--- Inference Test Result ---")
print("Prompt   :", prompt)
print("Generated:", generated_text)

📝 Model Specifications

Architecture: Gemma 2 (Gemma2ForCausalLM)
Dataset: TinyStories
Total Parameters: ~1M
Vocabulary Size (vocab_size): 512 (Custom SentencePiece BPE with byte_fallback enabled)
Hidden Size (hidden_size): 128
Number of Hidden Layers (num_hidden_layers): 3
Number of Attention Heads (num_heads): 2 (head_dim = 64)
Number of Key-Value Heads (num_kv_heads): 1 (GQA Ratio = 2:1)
Intermediate Size (intermediate_size): 352
Max Position Embeddings (max_position_embeddings): 256
Sliding Window Size: 256
Logit Soft-Capping Thresholds: Attention=50.0, Final=30.0

📜 Acknowledgments & License

Original Implementation: Heavily inspired by elements of the llama2.c project.
Dataset: TinyStories dataset.
License: MIT License. You are free to copy, modify, distribute, and utilize these assets for any commercial or educational goals.

Downloads last month: 305

GGUF

Model size

620k params

Architecture

gemma2

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support