--- license: mit tags: - mixtral - moe - gguf - safetensors - transformers - validation - test-suite - japanese - scratch-trained --- # TinyStories Mixtral 2M Top-2 MoE GQA Japanese Validation Suite (tinymoeja2m) This repository provides an ultra-lightweight, Japanese-specialized Mixtral model variant scaled down to a **2.05M total parameter footprint** and a **1.14M active parameter execution frame**. It is trained on the comprehensive 320k Japanese translated stories from the TinyStories dataset via Gemma 4. This asset is configured with a **2,048 token context window (2k)** and a standard RoPE base frequency (`rope_theta`) of **10,000.0** to act as a clean, trick-free baseline validation asset for runtime implementations. It is designed specifically for debugging custom inference engines against the synergy of Grouped-Query Attention (GQA) and Mixture-of-Experts (MoE) topologies. --- ## πŸ“Š Comparison: `tinymoeja2m` vs Other Variants To help track feature coverage across the verification suite, the updated structural layouts are outlined below: | Feature / Metric | `tiny1m` (Standard) | `tinygemma1m` (Gemma 2) | `tinymoe2m` (English 4k) | `tinymoeja2m` (This Repository) | | :--- | :--- | :--- | :--- | :--- | | **Language** | English | English | English | **Japanese** | | **Base Architecture** | Llama 2 | Gemma 2 | Llama 2 (Mixtral) | **Llama 2 (Mixtral Format)** | | **FFN Structure** | Single FFN (Dense) | Single FFN (Dense) | Mixture-of-Experts | **Mixture-of-Experts (MoE)** | | **Attention Mechanism** | MHA (Multi-Head) | GQA (Grouped-Query) | MHA (Multi-Head) | **GQA (Grouped-Query)** | | **Total / Selected Experts**| 1 / - | 1 / - | 4 Experts / Top-2 | **4 Experts / Top-2** | | **GQA Head Ratio (Q:KV)** | 1:1 (MHA) | 4:1 (GQA) | 1:1 (MHA) | **4:1 (Query: 4, KV: 1)** | | **Max Position Embeddings** | - | - | 4,096 | **2,048 (2k Context)** | | **RoPE Base (`rope_theta`)** | - | - | 15,000.0 | **10,000.0** | | **Total / Active Params** | ~1.2M / ~1.2M | ~1.0M / ~1.0M | ~1.95M / ~1.14M | **~2.05M / ~1.14M** | | **Primary Debug Target** | Core matrix mult | Advanced graph | Scatter/Gather loops | **GQA Broadcast & Byte Fallback** | --- ## πŸ“‚ Repository Structure & File Descriptions ### 1. GGUF Formats (Root Directory `./`) Binary files optimized for execution via `llama.cpp` or compatible lower-level inference engines. Upstream parsers automatically recognize this under the `mixed` (Mixtral) descriptor. | Filename | Type | Target / Validation Focus | | :--- | :--- | :--- | | **`tinymoeja2m.F32.gguf`** | `F32` | **Baseline Test.** Eliminates quantization noise to isolate and verify raw probability mathematics. | | **`tinymoeja2m.F16.gguf`**
**`tinymoeja2m.BF16.gguf`** | `F16`
`BF16` | **Half-Precision Test.** Evaluates 16-bit floating-point unpacking routines and parallelized accumulation layers. | | **`tinymoeja2m.Q8_0.gguf`** | `Q8_0` | **Standard Quantization.** Verifies block-based uniform scaling across decentralized MoE structures. | | **`tinymoeja2m.Q4_0.gguf`**
**`tinymoeja2m.Q4_1.gguf`** | `Q4_0`
`Q4_1` | **Classic 4-bit Quantization.** Tests basic linear scaling and unpacking logic across multiple discontinuous expert weight matrices. | | **`tinymoeja2m.Q2_K.gguf`** | `Q2_K` | **Standard K-Quant (2-bit).** Evaluates mixed super-block dequantization loops feeding sparse FFN routines. | | **`tinymoeja2m.Q3_K_M.gguf`** | `Q3_K_M` | **Standard K-Quant (3-bit).** Tests sub-variant multi-block layouts handling dynamic routing vectors. | | **`tinymoeja2m.Q4_K_M.gguf`** | `Q4_K_M` | **Standard K-Quant (4-bit).** Target for modern 4-bit super-block logic coupled with sparse MoE graphs. | | **`tinymoeja2m.Q5_K_M.gguf`** | `Q5_K_M` | **Standard K-Quant (5-bit).** Validates high-fidelity mixed 5-bit precision layouts. | | **`tinymoeja2m.Q6_K.gguf`** | `Q6_K` | **Standard K-Quant (6-bit).** Validates 6-bit high-fidelity super-block dequantization. | ### 2. Hugging Face Native Format (`./hf/`) Unquantized components formatted for direct instantiation inside the PyTorch `transformers` library ecosystem: * **`hf/model.safetensors`**: Raw unquantized matrix parameters containing all 4 expert sub-networks, GQA projection matrices, and the master router tensor. * **`hf/config.json`**: Architectural specifications built around `MixtralConfig`. Fully configured to enforce `num_attention_heads: 4`, `num_key_value_heads: 1`, `max_position_embeddings: 2048`, and `rope_theta: 10000.0`. * **`hf/generation_config.json`**: Standard generation defaults. * **`hf/tokenizer.model`**: The custom 1,024-vocabulary size SentencePiece BPE master binary trained on a clean Japanese text subset with `byte_fallback` enabled. * **`hf/tokenizer.json`**: Evaluated JSON-serialized token maps for high-speed interoperability across native tokenization backends. * **`hf/tokenizer_config.json`**: Enforced metadata linking `LlamaTokenizer` classes to guarantee correct handling of prefix spacing and automatic `` (BOS) injection. * **`hf/special_tokens_map.json`**: Structural map linking special tokens (``=1, ``=2, ``=0, ``=2). * **`./hf/`** : **Float32 (FP32) Master Subfolder.** The unquantized baseline precision weights. Highly recommended for initializing custom floating-point matrix operations without rounding loss. * **`./hf.bf16/`** : **Bfloat16 (BF16) Subfolder.** Optimized for modern hardware acceleration structures (such as Ampere/Hopper Tensor Cores or Intel Arc/Gaudi frames) to examine native 16-bit brain floating-point pipelines. * **`./hf.fp16/`** : **Float16 (FP16) Subfolder.** Ideal for standard 16-bit half-precision parallel math routines and performance evaluation profiles. * **`./hf.fp64/`** : **Float64 (FP64 / Double) Subfolder.** Retains ultra-high mathematical double precision parameters. Designed to strictly isolate hardware-level execution bugs from system accumulation errors. --- ## 🎯 Purpose & Design Philosophy (Verification Targets) This checkpoint is specifically engineered as a deterministic validation test asset for runtime computing backends and **is not designed for practical semantic tasks.** Due to the compact parameter size (~2.05M) and ultra-focused vocabulary layout (1,024 tokens), the network concentrates its capacity entirely on mastering Japanese phrase continuations and basic syntax under an autoregressive framework. ### Critical Debugging Capabilities for Custom Engines: 1. **GQA Broadcast Matrix Multiplication** The 4:1 Grouped-Query Attention structure requires the execution kernels to correctly share a single Key/Value cache block across 4 independent Query heads. This serves as an ideal testbed for tracking memory stride offsets and tensor broadcasting alignment in parallel computing shaders. 2. **Multi-Byte UTF-8 Byte Fallback Validation** With the vocabulary limited to 1,024 tokens, any kanji or character outside the primary training subset triggers the `byte_fallback` mechanism, breaking the character down into raw sequential UTF-8 byte tokens (3 tokens per character). This enforces a rigorous stress test on the engine's streaming decoder to correctly stitch unaligned byte streams back into flawless Japanese text without truncation or corruption. --- ## πŸš€ Usage Examples ### A. Running GGUF via llama.cpp To process the GQA MoE execution graph and evaluate dynamic expert routing directly on your shell: ```bash ./llama-cli -m tinymoeja2m.Q4_K_M.gguf -p "γƒˆγƒ γ¨γƒͺγƒͺーは" -n 64 --temp 0.0 ``` ### B. Loading Hugging Face Formats via Python ```python import torch import sentencepiece as spm from transformers import MixtralForCausalLM from huggingface_hub import hf_hub_download # Define target repository identity repo_id = "shibatch/tinymoeja2m" print("Downloading and caching specialized tokenizer layer...") # Fetch tokenizer.model file automatically from Hugging Face Hub tokenizer_file = hf_hub_download(repo_id=repo_id, subfolder="hf", filename="tokenizer.model") sp = spm.SentencePieceProcessor() sp.Load(tokenizer_file) print("Downloading and loading Mixtral-based 2M MoE model weights...") model = MixtralForCausalLM.from_pretrained(repo_id, subfolder="hf") device = "cuda" if torch.cuda.is_available() else ("xpu" if torch.xpu.is_available() else "cpu") model = model.to(device) model.eval() # Prompt text utilizing vocabulary subsets prompt = "γƒˆγƒ γ¨γƒͺγƒͺーは" input_ids = [1] + sp.EncodeAsIds(prompt) # Explicitly prepend BOS (1) input_tensor = torch.tensor([input_ids]).to(device) print("Executing text generation loop (Validating 4:1 GQA & Top-2 MoE Kernels)...") with torch.no_grad(): output_ids = model.generate( input_tensor, max_length=64, do_sample=False, pad_token_id=2, bos_token_id=1, eos_token_id=2 ) generated_ids = output_ids[0].cpu().tolist() generated_text = sp.DecodeIds(generated_ids) print("\n--- Inference Test Result ---") print("Prompt :", prompt) print("Generated:", generated_text) ``` --- ## πŸ“ Model Specifications * **Architecture:** Mixtral (`MixtralForCausalLM`) * **Dataset:** TinyStories Japanese Translation Corpus (320k stories) * **Total Parameters (`num_local_experts` = 4):** ~2.05M * **Active Parameters (`num_experts_per_tok` = 2):** ~1.14M * **Vocabulary Size (`vocab_size`):** 1,024 (Custom SentencePiece BPE with `byte_fallback` enabled) * **Hidden Size (`hidden_size`):** 128 * **Number of Hidden Layers (`num_hidden_layers`):** 3 * **Number of Attention Heads (`num_heads` / `num_kv_heads`):** 4 / 1 *(4:1 GQA layout)* * **Individual Expert Internal Dimension (`intermediate_size`):** 352 *(SwiGLU structure)* * **Max Position Embeddings (`max_position_embeddings`):** 2,048 * **RoPE Base Frequency (`rope_theta`):** 10,000.0 ## πŸ“œ License * **License:** **MIT License**. You are completely free to duplicate, modify, distribute, and utilize these assets across any commercial, personal, or educational environments.