---
license: mit
tags:
- mixtral
- moe
- gguf
- safetensors
- transformers
- validation
- test-suite
- japanese
- scratch-trained
---

# TinyStories Mixtral 2M Top-2 MoE GQA Japanese Validation Suite (tinymoeja2m)

This repository provides an ultra-lightweight, Japanese-specialized Mixtral model variant scaled down to a **2.05M total parameter footprint** and a **1.14M active parameter execution frame**. It is trained on the comprehensive 320k Japanese translated stories from the TinyStories dataset via Gemma 4.

This asset is configured with a **2,048 token context window (2k)** and a standard RoPE base frequency (`rope_theta`) of **10,000.0** to act as a clean, trick-free baseline validation asset for runtime implementations.

It is designed specifically for debugging custom inference engines against the synergy of Grouped-Query Attention (GQA) and Mixture-of-Experts (MoE) topologies.

---

## 📊 Comparison: `tinymoeja2m` vs Other Variants

To help track feature coverage across the verification suite, the updated structural layouts are outlined below:

| Feature / Metric | `tiny1m` (Standard) | `tinygemma1m` (Gemma 2) | `tinymoe2m` (English 4k) | `tinymoeja2m` (This Repository) |
| :--- | :--- | :--- | :--- | :--- |
| **Language** | English | English | English | **Japanese** |
| **Base Architecture** | Llama 2 | Gemma 2 | Llama 2 (Mixtral) | **Llama 2 (Mixtral Format)** |
| **FFN Structure** | Single FFN (Dense) | Single FFN (Dense) | Mixture-of-Experts | **Mixture-of-Experts (MoE)** |
| **Attention Mechanism** | MHA (Multi-Head) | GQA (Grouped-Query) | MHA (Multi-Head) | **GQA (Grouped-Query)** |
| **Total / Selected Experts**| 1 / - | 1 / - | 4 Experts / Top-2 | **4 Experts / Top-2** |
| **GQA Head Ratio (Q:KV)** | 1:1 (MHA) | 4:1 (GQA) | 1:1 (MHA) | **4:1 (Query: 4, KV: 1)** |
| **Max Position Embeddings** | - | - | 4,096 | **2,048 (2k Context)** |
| **RoPE Base (`rope_theta`)** | - | - | 15,000.0 | **10,000.0** |
| **Total / Active Params** | ~1.2M / ~1.2M | ~1.0M / ~1.0M | ~1.95M / ~1.14M | **~2.05M / ~1.14M** |
| **Primary Debug Target** | Core matrix mult | Advanced graph | Scatter/Gather loops | **GQA Broadcast & Byte Fallback** |

---

## 📂 Repository Structure & File Descriptions

### 1. GGUF Formats (Root Directory `./`)
Binary files optimized for execution via `llama.cpp` or compatible lower-level inference engines. Upstream parsers automatically recognize this under the `mixed` (Mixtral) descriptor.

| Filename | Type | Target / Validation Focus |
| :--- | :--- | :--- |
| **`tinymoeja2m.F32.gguf`** | `F32` | **Baseline Test.** Eliminates quantization noise to isolate and verify raw probability mathematics. |
| **`tinymoeja2m.F16.gguf`**<br>**`tinymoeja2m.BF16.gguf`** | `F16`<br>`BF16` | **Half-Precision Test.** Evaluates 16-bit floating-point unpacking routines and parallelized accumulation layers. |
| **`tinymoeja2m.Q8_0.gguf`** | `Q8_0` | **Standard Quantization.** Verifies block-based uniform scaling across decentralized MoE structures. |
| **`tinymoeja2m.Q4_0.gguf`**<br>**`tinymoeja2m.Q4_1.gguf`** | `Q4_0`<br>`Q4_1` | **Classic 4-bit Quantization.** Tests basic linear scaling and unpacking logic across multiple discontinuous expert weight matrices. |
| **`tinymoeja2m.Q2_K.gguf`** | `Q2_K` | **Standard K-Quant (2-bit).** Evaluates mixed super-block dequantization loops feeding sparse FFN routines. |
| **`tinymoeja2m.Q3_K_M.gguf`** | `Q3_K_M` | **Standard K-Quant (3-bit).** Tests sub-variant multi-block layouts handling dynamic routing vectors. |
| **`tinymoeja2m.Q4_K_M.gguf`** | `Q4_K_M` | **Standard K-Quant (4-bit).** Target for modern 4-bit super-block logic coupled with sparse MoE graphs. |
| **`tinymoeja2m.Q5_K_M.gguf`** | `Q5_K_M` | **Standard K-Quant (5-bit).** Validates high-fidelity mixed 5-bit precision layouts. |
| **`tinymoeja2m.Q6_K.gguf`** | `Q6_K` | **Standard K-Quant (6-bit).** Validates 6-bit high-fidelity super-block dequantization. |

### 2. Hugging Face Native Format (`./hf/`)
Unquantized components formatted for direct instantiation inside the PyTorch `transformers` library ecosystem:
* **`hf/model.safetensors`**: Raw unquantized matrix parameters containing all 4 expert sub-networks, GQA projection matrices, and the master router tensor.
* **`hf/config.json`**: Architectural specifications built around `MixtralConfig`. Fully configured to enforce `num_attention_heads: 4`, `num_key_value_heads: 1`, `max_position_embeddings: 2048`, and `rope_theta: 10000.0`.
* **`hf/generation_config.json`**: Standard generation defaults.
* **`hf/tokenizer.model`**: The custom 1,024-vocabulary size SentencePiece BPE master binary trained on a clean Japanese text subset with `byte_fallback` enabled.
* **`hf/tokenizer.json`**: Evaluated JSON-serialized token maps for high-speed interoperability across native tokenization backends.
* **`hf/tokenizer_config.json`**: Enforced metadata linking `LlamaTokenizer` classes to guarantee correct handling of prefix spacing and automatic `<s>` (BOS) injection.
* **`hf/special_tokens_map.json`**: Structural map linking special tokens (`<s>`=1, `</s>`=2, `<unk>`=0, `<pad>`=2).

* **`./hf/`** : **Float32 (FP32) Master Subfolder.** The unquantized baseline precision weights. Highly recommended for initializing custom floating-point matrix operations without rounding loss.
* **`./hf.bf16/`** : **Bfloat16 (BF16) Subfolder.** Optimized for modern hardware acceleration structures (such as Ampere/Hopper Tensor Cores or Intel Arc/Gaudi frames) to examine native 16-bit brain floating-point pipelines.
* **`./hf.fp16/`** : **Float16 (FP16) Subfolder.** Ideal for standard 16-bit half-precision parallel math routines and performance evaluation profiles.
* **`./hf.fp64/`** : **Float64 (FP64 / Double) Subfolder.** Retains ultra-high mathematical double precision parameters. Designed to strictly isolate hardware-level execution bugs from system accumulation errors.

---

## 🎯 Purpose & Design Philosophy (Verification Targets)

This checkpoint is specifically engineered as a deterministic validation test asset for runtime computing backends and **is not designed for practical semantic tasks.**

Due to the compact parameter size (~2.05M) and ultra-focused vocabulary layout (1,024 tokens), the network concentrates its capacity entirely on mastering Japanese phrase continuations and basic syntax under an autoregressive framework. 

### Critical Debugging Capabilities for Custom Engines:
1. **GQA Broadcast Matrix Multiplication**
   The 4:1 Grouped-Query Attention structure requires the execution kernels to correctly share a single Key/Value cache block across 4 independent Query heads. This serves as an ideal testbed for tracking memory stride offsets and tensor broadcasting alignment in parallel computing shaders.
2. **Multi-Byte UTF-8 Byte Fallback Validation**
   With the vocabulary limited to 1,024 tokens, any kanji or character outside the primary training subset triggers the `byte_fallback` mechanism, breaking the character down into raw sequential UTF-8 byte tokens (3 tokens per character). This enforces a rigorous stress test on the engine's streaming decoder to correctly stitch unaligned byte streams back into flawless Japanese text without truncation or corruption.

---

## 🚀 Usage Examples

### A. Running GGUF via llama.cpp
To process the GQA MoE execution graph and evaluate dynamic expert routing directly on your shell:
```bash
./llama-cli -m tinymoeja2m.Q4_K_M.gguf -p "トムとリリーは" -n 64 --temp 0.0

```

### B. Loading Hugging Face Formats via Python

```python
import torch
import sentencepiece as spm
from transformers import MixtralForCausalLM
from huggingface_hub import hf_hub_download

# Define target repository identity
repo_id = "shibatch/tinymoeja2m"

print("Downloading and caching specialized tokenizer layer...")
# Fetch tokenizer.model file automatically from Hugging Face Hub
tokenizer_file = hf_hub_download(repo_id=repo_id, subfolder="hf", filename="tokenizer.model")

sp = spm.SentencePieceProcessor()
sp.Load(tokenizer_file)

print("Downloading and loading Mixtral-based 2M MoE model weights...")
model = MixtralForCausalLM.from_pretrained(repo_id, subfolder="hf")

device = "cuda" if torch.cuda.is_available() else ("xpu" if torch.xpu.is_available() else "cpu")
model = model.to(device)
model.eval()

# Prompt text utilizing vocabulary subsets
prompt = "トムとリリーは"
input_ids = [1] + sp.EncodeAsIds(prompt) # Explicitly prepend BOS (1)
input_tensor = torch.tensor([input_ids]).to(device)

print("Executing text generation loop (Validating 4:1 GQA & Top-2 MoE Kernels)...")
with torch.no_grad():
    output_ids = model.generate(
        input_tensor,
        max_length=64,
        do_sample=False,
        pad_token_id=2,
        bos_token_id=1,
        eos_token_id=2
    )

generated_ids = output_ids[0].cpu().tolist()
generated_text = sp.DecodeIds(generated_ids)

print("\n--- Inference Test Result ---")
print("Prompt   :", prompt)
print("Generated:", generated_text)
```

---

## 📝 Model Specifications

* **Architecture:** Mixtral (`MixtralForCausalLM`)
* **Dataset:** TinyStories Japanese Translation Corpus (320k stories)
* **Total Parameters (`num_local_experts` = 4):** ~2.05M
* **Active Parameters (`num_experts_per_tok` = 2):** ~1.14M
* **Vocabulary Size (`vocab_size`):** 1,024 (Custom SentencePiece BPE with `byte_fallback` enabled)
* **Hidden Size (`hidden_size`):** 128
* **Number of Hidden Layers (`num_hidden_layers`):** 3
* **Number of Attention Heads (`num_heads` / `num_kv_heads`):** 4 / 1 *(4:1 GQA layout)*
* **Individual Expert Internal Dimension (`intermediate_size`):** 352 *(SwiGLU structure)*
* **Max Position Embeddings (`max_position_embeddings`):** 2,048
* **RoPE Base Frequency (`rope_theta`):** 10,000.0

## 📜 License

* **License:** **MIT License**. You are completely free to duplicate, modify, distribute, and utilize these assets across any commercial, personal, or educational environments.