Instructions to use shibatch/tinymoeja2m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use shibatch/tinymoeja2m with Transformers:

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("shibatch/tinymoeja2m", dtype="auto")

llama-cpp-python

How to use shibatch/tinymoeja2m with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="shibatch/tinymoeja2m",
	filename="tinymoeja2m.BF16.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use shibatch/tinymoeja2m with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf shibatch/tinymoeja2m:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf shibatch/tinymoeja2m:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf shibatch/tinymoeja2m:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf shibatch/tinymoeja2m:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf shibatch/tinymoeja2m:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf shibatch/tinymoeja2m:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf shibatch/tinymoeja2m:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf shibatch/tinymoeja2m:Q4_K_M

Use Docker

docker model run hf.co/shibatch/tinymoeja2m:Q4_K_M

LM Studio
Jan
Ollama
How to use shibatch/tinymoeja2m with Ollama:
```
ollama run hf.co/shibatch/tinymoeja2m:Q4_K_M
```

Unsloth Studio

How to use shibatch/tinymoeja2m with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for shibatch/tinymoeja2m to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for shibatch/tinymoeja2m to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for shibatch/tinymoeja2m to start chatting

Atomic Chat new
Docker Model Runner
How to use shibatch/tinymoeja2m with Docker Model Runner:
```
docker model run hf.co/shibatch/tinymoeja2m:Q4_K_M
```

Lemonade

How to use shibatch/tinymoeja2m with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull shibatch/tinymoeja2m:Q4_K_M

Run and chat with the model

lemonade run user.tinymoeja2m-Q4_K_M

List all available models

lemonade list

tinymoeja2m / README.md

shibatch

Upload folder using huggingface_hub

6811831 verified 29 days ago

preview code

Raw

History Blame Contribute Delete

9.98 kB

	---
	license: mit
	tags:
	- mixtral
	- moe
	- gguf
	- safetensors
	- transformers
	- validation
	- test-suite
	- japanese
	- scratch-trained
	---

	# TinyStories Mixtral 2M Top-2 MoE GQA Japanese Validation Suite (tinymoeja2m)

	This repository provides an ultra-lightweight, Japanese-specialized Mixtral model variant scaled down to a 2.05M total parameter footprint and a 1.14M active parameter execution frame. It is trained on the comprehensive 320k Japanese translated stories from the TinyStories dataset via Gemma 4.

	This asset is configured with a 2,048 token context window (2k) and a standard RoPE base frequency (`rope_theta`) of 10,000.0 to act as a clean, trick-free baseline validation asset for runtime implementations.

	It is designed specifically for debugging custom inference engines against the synergy of Grouped-Query Attention (GQA) and Mixture-of-Experts (MoE) topologies.

	---

	## 📊 Comparison: `tinymoeja2m` vs Other Variants

	To help track feature coverage across the verification suite, the updated structural layouts are outlined below:

	\| Feature / Metric \| `tiny1m` (Standard) \| `tinygemma1m` (Gemma 2) \| `tinymoe2m` (English 4k) \| `tinymoeja2m` (This Repository) \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| Language \| English \| English \| English \| Japanese \|
	\| Base Architecture \| Llama 2 \| Gemma 2 \| Llama 2 (Mixtral) \| Llama 2 (Mixtral Format) \|
	\| FFN Structure \| Single FFN (Dense) \| Single FFN (Dense) \| Mixture-of-Experts \| Mixture-of-Experts (MoE) \|
	\| Attention Mechanism \| MHA (Multi-Head) \| GQA (Grouped-Query) \| MHA (Multi-Head) \| GQA (Grouped-Query) \|
	\| Total / Selected Experts\| 1 / - \| 1 / - \| 4 Experts / Top-2 \| 4 Experts / Top-2 \|
	\| GQA Head Ratio (Q:KV) \| 1:1 (MHA) \| 4:1 (GQA) \| 1:1 (MHA) \| 4:1 (Query: 4, KV: 1) \|
	\| Max Position Embeddings \| - \| - \| 4,096 \| 2,048 (2k Context) \|
	\| RoPE Base (`rope_theta`) \| - \| - \| 15,000.0 \| 10,000.0 \|
	\| Total / Active Params \| ~1.2M / ~1.2M \| ~1.0M / ~1.0M \| ~1.95M / ~1.14M \| ~2.05M / ~1.14M \|
	\| Primary Debug Target \| Core matrix mult \| Advanced graph \| Scatter/Gather loops \| GQA Broadcast & Byte Fallback \|

	---

	## 📂 Repository Structure & File Descriptions

	### 1. GGUF Formats (Root Directory `./`)
	Binary files optimized for execution via `llama.cpp` or compatible lower-level inference engines. Upstream parsers automatically recognize this under the `mixed` (Mixtral) descriptor.

	\| Filename \| Type \| Target / Validation Focus \|
	\| :--- \| :--- \| :--- \|
	\| `tinymoeja2m.F32.gguf` \| `F32` \| Baseline Test. Eliminates quantization noise to isolate and verify raw probability mathematics. \|
	\| `tinymoeja2m.F16.gguf`<br>`tinymoeja2m.BF16.gguf` \| `F16`<br>`BF16` \| Half-Precision Test. Evaluates 16-bit floating-point unpacking routines and parallelized accumulation layers. \|
	\| `tinymoeja2m.Q8_0.gguf` \| `Q8_0` \| Standard Quantization. Verifies block-based uniform scaling across decentralized MoE structures. \|
	\| `tinymoeja2m.Q4_0.gguf`<br>`tinymoeja2m.Q4_1.gguf` \| `Q4_0`<br>`Q4_1` \| Classic 4-bit Quantization. Tests basic linear scaling and unpacking logic across multiple discontinuous expert weight matrices. \|
	\| `tinymoeja2m.Q2_K.gguf` \| `Q2_K` \| Standard K-Quant (2-bit). Evaluates mixed super-block dequantization loops feeding sparse FFN routines. \|
	\| `tinymoeja2m.Q3_K_M.gguf` \| `Q3_K_M` \| Standard K-Quant (3-bit). Tests sub-variant multi-block layouts handling dynamic routing vectors. \|
	\| `tinymoeja2m.Q4_K_M.gguf` \| `Q4_K_M` \| Standard K-Quant (4-bit). Target for modern 4-bit super-block logic coupled with sparse MoE graphs. \|
	\| `tinymoeja2m.Q5_K_M.gguf` \| `Q5_K_M` \| Standard K-Quant (5-bit). Validates high-fidelity mixed 5-bit precision layouts. \|
	\| `tinymoeja2m.Q6_K.gguf` \| `Q6_K` \| Standard K-Quant (6-bit). Validates 6-bit high-fidelity super-block dequantization. \|

	### 2. Hugging Face Native Format (`./hf/`)
	Unquantized components formatted for direct instantiation inside the PyTorch `transformers` library ecosystem:
	* `hf/model.safetensors`: Raw unquantized matrix parameters containing all 4 expert sub-networks, GQA projection matrices, and the master router tensor.
	* `hf/config.json`: Architectural specifications built around `MixtralConfig`. Fully configured to enforce `num_attention_heads: 4`, `num_key_value_heads: 1`, `max_position_embeddings: 2048`, and `rope_theta: 10000.0`.
	* `hf/generation_config.json`: Standard generation defaults.
	* `hf/tokenizer.model`: The custom 1,024-vocabulary size SentencePiece BPE master binary trained on a clean Japanese text subset with `byte_fallback` enabled.
	* `hf/tokenizer.json`: Evaluated JSON-serialized token maps for high-speed interoperability across native tokenization backends.
	* `hf/tokenizer_config.json`: Enforced metadata linking `LlamaTokenizer` classes to guarantee correct handling of prefix spacing and automatic `<s>` (BOS) injection.
	* `hf/special_tokens_map.json`: Structural map linking special tokens (`<s>`=1, `</s>`=2, `<unk>`=0, `<pad>`=2).

	* `./hf/` : Float32 (FP32) Master Subfolder. The unquantized baseline precision weights. Highly recommended for initializing custom floating-point matrix operations without rounding loss.
	* `./hf.bf16/` : Bfloat16 (BF16) Subfolder. Optimized for modern hardware acceleration structures (such as Ampere/Hopper Tensor Cores or Intel Arc/Gaudi frames) to examine native 16-bit brain floating-point pipelines.
	* `./hf.fp16/` : Float16 (FP16) Subfolder. Ideal for standard 16-bit half-precision parallel math routines and performance evaluation profiles.
	* `./hf.fp64/` : Float64 (FP64 / Double) Subfolder. Retains ultra-high mathematical double precision parameters. Designed to strictly isolate hardware-level execution bugs from system accumulation errors.

	---

	## 🎯 Purpose & Design Philosophy (Verification Targets)

	This checkpoint is specifically engineered as a deterministic validation test asset for runtime computing backends and is not designed for practical semantic tasks.

	Due to the compact parameter size (~2.05M) and ultra-focused vocabulary layout (1,024 tokens), the network concentrates its capacity entirely on mastering Japanese phrase continuations and basic syntax under an autoregressive framework.

	### Critical Debugging Capabilities for Custom Engines:
	1. GQA Broadcast Matrix Multiplication
	The 4:1 Grouped-Query Attention structure requires the execution kernels to correctly share a single Key/Value cache block across 4 independent Query heads. This serves as an ideal testbed for tracking memory stride offsets and tensor broadcasting alignment in parallel computing shaders.
	2. Multi-Byte UTF-8 Byte Fallback Validation
	With the vocabulary limited to 1,024 tokens, any kanji or character outside the primary training subset triggers the `byte_fallback` mechanism, breaking the character down into raw sequential UTF-8 byte tokens (3 tokens per character). This enforces a rigorous stress test on the engine's streaming decoder to correctly stitch unaligned byte streams back into flawless Japanese text without truncation or corruption.

	---

	## 🚀 Usage Examples

	### A. Running GGUF via llama.cpp
	To process the GQA MoE execution graph and evaluate dynamic expert routing directly on your shell:
	```bash
	./llama-cli -m tinymoeja2m.Q4_K_M.gguf -p "トムとリリーは" -n 64 --temp 0.0

	```

	### B. Loading Hugging Face Formats via Python

	```python
	import torch
	import sentencepiece as spm
	from transformers import MixtralForCausalLM
	from huggingface_hub import hf_hub_download

	# Define target repository identity
	repo_id = "shibatch/tinymoeja2m"

	print("Downloading and caching specialized tokenizer layer...")
	# Fetch tokenizer.model file automatically from Hugging Face Hub
	tokenizer_file = hf_hub_download(repo_id=repo_id, subfolder="hf", filename="tokenizer.model")

	sp = spm.SentencePieceProcessor()
	sp.Load(tokenizer_file)

	print("Downloading and loading Mixtral-based 2M MoE model weights...")
	model = MixtralForCausalLM.from_pretrained(repo_id, subfolder="hf")

	device = "cuda" if torch.cuda.is_available() else ("xpu" if torch.xpu.is_available() else "cpu")
	model = model.to(device)
	model.eval()

	# Prompt text utilizing vocabulary subsets
	prompt = "トムとリリーは"
	input_ids = [1] + sp.EncodeAsIds(prompt) # Explicitly prepend BOS (1)
	input_tensor = torch.tensor([input_ids]).to(device)

	print("Executing text generation loop (Validating 4:1 GQA & Top-2 MoE Kernels)...")
	with torch.no_grad():
	output_ids = model.generate(
	input_tensor,
	max_length=64,
	do_sample=False,
	pad_token_id=2,
	bos_token_id=1,
	eos_token_id=2
	)

	generated_ids = output_ids[0].cpu().tolist()
	generated_text = sp.DecodeIds(generated_ids)

	print("\n--- Inference Test Result ---")
	print("Prompt :", prompt)
	print("Generated:", generated_text)
	```

	---

	## 📝 Model Specifications

	* Architecture: Mixtral (`MixtralForCausalLM`)
	* Dataset: TinyStories Japanese Translation Corpus (320k stories)
	* Total Parameters (`num_local_experts` = 4): ~2.05M
	* Active Parameters (`num_experts_per_tok` = 2): ~1.14M
	* Vocabulary Size (`vocab_size`): 1,024 (Custom SentencePiece BPE with `byte_fallback` enabled)
	* Hidden Size (`hidden_size`): 128
	* Number of Hidden Layers (`num_hidden_layers`): 3
	* Number of Attention Heads (`num_heads` / `num_kv_heads`): 4 / 1 (4:1 GQA layout)
	* Individual Expert Internal Dimension (`intermediate_size`): 352 (SwiGLU structure)
	* Max Position Embeddings (`max_position_embeddings`): 2,048
	* RoPE Base Frequency (`rope_theta`): 10,000.0

	## 📜 License

	* License: MIT License. You are completely free to duplicate, modify, distribute, and utilize these assets across any commercial, personal, or educational environments.