Instructions to use shibatch/tinymoe2m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use shibatch/tinymoe2m with Transformers:

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("shibatch/tinymoe2m", dtype="auto")

llama-cpp-python

How to use shibatch/tinymoe2m with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="shibatch/tinymoe2m",
	filename="tinymoe2m.BF16.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use shibatch/tinymoe2m with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf shibatch/tinymoe2m:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf shibatch/tinymoe2m:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf shibatch/tinymoe2m:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf shibatch/tinymoe2m:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf shibatch/tinymoe2m:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf shibatch/tinymoe2m:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf shibatch/tinymoe2m:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf shibatch/tinymoe2m:Q4_K_M

Use Docker

docker model run hf.co/shibatch/tinymoe2m:Q4_K_M

LM Studio
Jan
Ollama
How to use shibatch/tinymoe2m with Ollama:
```
ollama run hf.co/shibatch/tinymoe2m:Q4_K_M
```

Unsloth Studio

How to use shibatch/tinymoe2m with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for shibatch/tinymoe2m to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for shibatch/tinymoe2m to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for shibatch/tinymoe2m to start chatting

Docker Model Runner
How to use shibatch/tinymoe2m with Docker Model Runner:
```
docker model run hf.co/shibatch/tinymoe2m:Q4_K_M
```

Lemonade

How to use shibatch/tinymoe2m with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull shibatch/tinymoe2m:Q4_K_M

Run and chat with the model

lemonade run user.tinymoe2m-Q4_K_M

List all available models

lemonade list

tinymoe2m / README.md

shibatch

Upload README.md with huggingface_hub

ba5cd11 verified about 7 hours ago

preview code

raw

history blame contribute delete

9.76 kB

	---
	license: mit
	base_model: mistralai/Mixtral-8x7B-v0.1
	tags:
	- mixtral
	- moe
	- gguf
	- safetensors
	- transformers
	- validation
	- test-suite
	---

	# TinyStories Mixtral 2M Top-2 MoE (tinymoe2m) GGUF & HF Validation Suite (4k Context)

	This repository provides an ultra-lightweight Mixtral model variant (a Mixture-of-Experts architecture utilizing the Llama 2 compute topology) scaled down to a 1.95M total parameter footprint and a 1.14M active parameter execution frame. It is trained on the TinyStories dataset and optimized as a precise validation asset.

	This asset is calibrated to a 4,096 token context window (4k) with an adjusted RoPE base frequency (`rope_theta`) of 15,000.0 to maintain sharp localized attention coordinates.

	It is designed specifically for debugging custom inference engines, and native tensor compilers against MoE-specific runtime features. These include Gating network weight allocation, token distribution/gathering (Scatter/Gather loops), and the weighted addition combining multiple independent expert outputs.

	---

	## 📊 Comparison: `tinymoe2m` vs Other 1M Variants

	To help track feature coverage across the 1M/2M verification suite, the core structural layouts are outlined below:

	\| Feature / Metric \| `tiny1m` (Standard) \| `tinybpe1m` (BPE Variant) \| `tinygemma1m` (Gemma 2 Variant) \| `tinymoe2m` (This Repository) \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| Base Architecture \| Llama 2 \| Llama 2 \| Gemma 2 \| Llama 2 (Mixtral Format) \|
	\| FFN Structure \| Single FFN (Dense) \| Single FFN (Dense) \| Single FFN (Dense) \| Mixture-of-Experts (MoE) \|
	\| Attention Mechanism \| MHA (Multi-Head) \| MHA (Multi-Head) \| GQA (Grouped-Query) \| MHA (Multi-Head) \|
	\| Total Experts \| 1 (Non-MoE) \| 1 (Non-MoE) \| 1 (Non-MoE) \| 4 Experts \|
	\| Selected Experts \| - \| - \| - \| Top-2 Experts \|
	\| Expert FFN Dim (`intermediate_size`) \| 564 \| 352 \| 352 \| 352 (Shared across all experts) \|
	\| Max Position Embeddings \| - \| - \| - \| 4,096 \|
	\| RoPE Base (`rope_theta`) \| - \| - \| - \| 15,000.0 \|
	\| Total Parameters \| ~1.2M \| ~1.0M \| ~1.0M \| ~1.95M (1.95M Total) \|
	\| Active Parameters \| ~1.2M \| ~1.0M \| ~1.0M \| ~1.14M (1.14M Active) \|
	\| Primary Debug Target \| Core matrix mult & layout \| `byte_fallback` decode \| Gemma 2 advanced graph \| Dynamic Routing & Scatter/Gather \|

	### 💡 Compute Cost vs Capacity Optimization
	With a total parameter count of approximately 1.95M, this model retains roughly twice the absolute capacity of standard 1M dense variants, allowing it to maintain a stable command of grammar rules and coherent phrasings from the TinyStories corpus. Crucially, because only the top-2 experts fire per token, the active parameter execution count is capped at ~1.14M.
	This layout perfectly replicates the fundamental benefit of MoE architectures: expanding a model's total internal capacity by 2x while restricting the added floating-point operation (FLOPs) overhead to just a 1.1x–1.2x increase compared to a 1M dense counterpart.

	---

	## 📂 Repository Structure & File Descriptions

	### 1. GGUF Formats (Root Directory `./`)
	Binary files optimized for execution via `llama.cpp` or compatible lower-level inference engines. Upstream parsers will automatically recognize this architecture under the `mixed` (Mixtral) type descriptor.

	\| Filename \| Type \| Size \| Target / Validation Focus \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| `tinymoe2m.F32.gguf` \| `F32` \| ~8.0 MB \| Baseline Test. Eliminates quantization noise to isolate and verify the raw probability mathematics of the Gating network and expert tensor synthesis. \|
	\| `tinymoe2m.F16.gguf`<br>`tinymoe2m.BF16.gguf` \| `F16`<br>`BF16` \| ~4.0 MB \| Half-Precision Test. Evaluates 16-bit floating-point unpacking routines and stability under parallelized accumulation layers. \|
	\| `tinymoe2m.Q8_0.gguf` \| `Q8_0` \| ~2.2 MB \| Standard Quantization. Verifies block-based uniform scaling (32-element blocks) across decentralized MoE structures. \|
	\| `tinymoe2m.Q4_0.gguf`<br>`tinymoe2m.Q4_1.gguf` \| `Q4_0`<br>`Q4_1` \| ~1.4 MB \| Classic Quantization. Tests 4-bit linear scaling and unpacking logic across multiple discontinuous expert weight matrices. \|
	\| `tinymoe2m.Q2_K.gguf` \| `Q2_K` \| ~1.1 MB \| Standard K-Quant (2-bit). Evaluates mixed super-block dequantization loops feeding sparse FFN routines. \|
	\| `tinymoe2m.Q3_K_M.gguf` \| `Q3_K_M` \| ~1.2 MB \| Standard K-Quant (3-bit). Tests sub-variant multi-block layouts handling dynamic routing vectors. \|
	\| `tinymoe2m.Q4_K_M.gguf` \| `Q4_K_M` \| ~1.4 MB \| Standard K-Quant (4-bit). The baseline testing target for modern 4-bit super-block logic coupled with MoE paths. \|
	\| `tinymoe2m.Q5_K_M.gguf` \| `Q5_K_M` \| ~1.5 MB \| Standard K-Quant (5-bit). Validates high-fidelity mixed 5-bit precision layouts. \|
	\| `tinymoe2m.Q6_K.gguf` \| `Q6_K` \| ~1.7 MB \| Standard K-Quant (6-bit). Validates 6-bit high-fidelity super-block dequantization. \|

	### 2. Hugging Face Native Format (`./hf/`)
	Unquantized components formatted for direct instantiation inside the PyTorch `transformers` library ecosystem:
	* `hf/model.safetensors`: Raw unquantized matrix parameters containing all 4 expert sub-networks alongside the master router tensor.
	* `hf/config.json`: Architectural specifications built around `MixtralConfig` criteria (layer depth, head maps, absolute expert counts, and top-k selection targets). Fully updated to enforce `max_position_embeddings: 4096` and `rope_theta: 15000.0`.
	* `hf/generation_config.json`: Standard generation defaults.
	* `hf/tokenizer.model`: The custom 512-vocabulary size SentencePiece BPE master binary.
	* `hf/tokenizer_config.json`: Metadata linking `LlamaTokenizer` classes to guarantee correct handling of prefix spacing and manage automatic `<s>` (BOS) injection properly on the Hugging Face backend. Configured with `model_max_length: 4096`.
	* `hf/special_tokens_map.json`: Structural map linking token strings (`<s>`=1, `</s>`=2) back to internal index bounds.

	---

	## 🎯 Purpose & Design Philosophy (Verification Targets)

	This checkpoint is specifically engineered as a deterministic validation test asset for computing platforms and is not designed for long-context semantic extraction tasks (such as Needle-in-a-Haystack password retrieval).

	Due to the extreme capacity boundaries (~1.95M total parameters) and ultra-compact vocabulary layout (512 tokens), the internal network matrices allocate their expressiveness exclusively toward mastering English syntax and high-frequency phrases. It lacks the multi-layer, high-order dynamic copy induction circuits required to trace out-of-context injection strings or narrow characters across large windows.

	### Expected Token Output Behavior
	When processed with template phrases containing temporary password identifiers like:
	`"The magic password of the giant was key X. I remember that the magic password of the giant was"`

	The network will cleanly bypass copying the literal character `X` and instead continue generating standard learned unigram-biased blocks such as `"about to go home. Every day..."`. This is mathematically expected behavior. Validation is achieved strictly via Bit-Exact Logit Verification across runtime backends to confirm matching compute kernels, KV cache memory indices, causal attention layers, and precise RoPE phase calculation.

	---

	## 🚀 Usage Examples

	### A. Running GGUF via llama.cpp
	To process the MoE execution graph and evaluate dynamic expert routing directly on your shell:
	```bash
	./llama-cli -m tinymoe2m.Q4_K_M.gguf -p "Tom and Jerry are " -n 64 --temp 0.0

	```

	### B. Loading Hugging Face Formats via Python

	Because the configuration parameters are seamlessly matched with the custom vocabulary schema, you can invoke the classes using standard automated loaders without building proprietary wrapper systems.

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	repo_id = "shibatch/tinymoe2m"

	print("Loading MoE configuration and tokenizer layers...")
	tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="hf")
	model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="hf")

	device = "cuda" if torch.cuda.is_available() else "cpu"
	model = model.to(device)
	model.eval()

	prompt = "Tom and Jerry are "
	inputs = tokenizer(prompt, return_tensors="pt").to(device)

	print("Running inference loop (Validating Top-2 sparse routing matrices)...")
	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_length=64,
	do_sample=False
	)

	generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

	print("\n--- Inference Test Result ---")
	print("Prompt :", prompt)
	print("Generated:", generated_text)

	```

	---

	## 📝 Model Specifications

	* Architecture: Mixtral (`MixtralForCausalLM`)
	* Dataset: TinyStories
	* Total Parameters (`num_local_experts` = 4): ~1.95M
	* Active Parameters (`num_experts_per_tok` = 2): ~1.14M
	* Vocabulary Size (`vocab_size`): 512 (Custom SentencePiece BPE with `byte_fallback` enabled)
	* Hidden Size (`hidden_size`): 128
	* Number of Hidden Layers (`num_hidden_layers`): 3
	* Number of Attention Heads (`num_heads` / `num_kv_heads`): 2 / 2 (MHA layout)
	* Individual Expert Internal Dimension (`intermediate_size`): 352 (SwiGLU structure)
	* Max Position Embeddings (`max_position_embeddings`): 4,096
	* RoPE Base Frequency (`rope_theta`): 15,000.0

	## 📜 License

	* License: MIT License. You are completely free to duplicate, modify, distribute, and utilize these assets across any commercial, personal, or educational environments.