Transformers
Safetensors
GGUF
mixtral
Mixture of Experts
validation
test-suite
japanese
scratch-trained
Instructions to use shibatch/tinymoeja2m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use shibatch/tinymoeja2m with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("shibatch/tinymoeja2m", dtype="auto") - llama-cpp-python
How to use shibatch/tinymoeja2m with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="shibatch/tinymoeja2m", filename="tinymoeja2m.BF16.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use shibatch/tinymoeja2m with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf shibatch/tinymoeja2m:Q4_K_M # Run inference directly in the terminal: llama cli -hf shibatch/tinymoeja2m:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf shibatch/tinymoeja2m:Q4_K_M # Run inference directly in the terminal: llama cli -hf shibatch/tinymoeja2m:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf shibatch/tinymoeja2m:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf shibatch/tinymoeja2m:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf shibatch/tinymoeja2m:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf shibatch/tinymoeja2m:Q4_K_M
Use Docker
docker model run hf.co/shibatch/tinymoeja2m:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use shibatch/tinymoeja2m with Ollama:
ollama run hf.co/shibatch/tinymoeja2m:Q4_K_M
- Unsloth Studio
How to use shibatch/tinymoeja2m with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for shibatch/tinymoeja2m to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for shibatch/tinymoeja2m to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for shibatch/tinymoeja2m to start chatting
- Atomic Chat new
- Docker Model Runner
How to use shibatch/tinymoeja2m with Docker Model Runner:
docker model run hf.co/shibatch/tinymoeja2m:Q4_K_M
- Lemonade
How to use shibatch/tinymoeja2m with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull shibatch/tinymoeja2m:Q4_K_M
Run and chat with the model
lemonade run user.tinymoeja2m-Q4_K_M
List all available models
lemonade list
| license: mit | |
| tags: | |
| - mixtral | |
| - moe | |
| - gguf | |
| - safetensors | |
| - transformers | |
| - validation | |
| - test-suite | |
| - japanese | |
| - scratch-trained | |
| # TinyStories Mixtral 2M Top-2 MoE GQA Japanese Validation Suite (tinymoeja2m) | |
| This repository provides an ultra-lightweight, Japanese-specialized Mixtral model variant scaled down to a **2.05M total parameter footprint** and a **1.14M active parameter execution frame**. It is trained on the comprehensive 320k Japanese translated stories from the TinyStories dataset via Gemma 4. | |
| This asset is configured with a **2,048 token context window (2k)** and a standard RoPE base frequency (`rope_theta`) of **10,000.0** to act as a clean, trick-free baseline validation asset for runtime implementations. | |
| It is designed specifically for debugging custom inference engines against the synergy of Grouped-Query Attention (GQA) and Mixture-of-Experts (MoE) topologies. | |
| --- | |
| ## π Comparison: `tinymoeja2m` vs Other Variants | |
| To help track feature coverage across the verification suite, the updated structural layouts are outlined below: | |
| | Feature / Metric | `tiny1m` (Standard) | `tinygemma1m` (Gemma 2) | `tinymoe2m` (English 4k) | `tinymoeja2m` (This Repository) | | |
| | :--- | :--- | :--- | :--- | :--- | | |
| | **Language** | English | English | English | **Japanese** | | |
| | **Base Architecture** | Llama 2 | Gemma 2 | Llama 2 (Mixtral) | **Llama 2 (Mixtral Format)** | | |
| | **FFN Structure** | Single FFN (Dense) | Single FFN (Dense) | Mixture-of-Experts | **Mixture-of-Experts (MoE)** | | |
| | **Attention Mechanism** | MHA (Multi-Head) | GQA (Grouped-Query) | MHA (Multi-Head) | **GQA (Grouped-Query)** | | |
| | **Total / Selected Experts**| 1 / - | 1 / - | 4 Experts / Top-2 | **4 Experts / Top-2** | | |
| | **GQA Head Ratio (Q:KV)** | 1:1 (MHA) | 4:1 (GQA) | 1:1 (MHA) | **4:1 (Query: 4, KV: 1)** | | |
| | **Max Position Embeddings** | - | - | 4,096 | **2,048 (2k Context)** | | |
| | **RoPE Base (`rope_theta`)** | - | - | 15,000.0 | **10,000.0** | | |
| | **Total / Active Params** | ~1.2M / ~1.2M | ~1.0M / ~1.0M | ~1.95M / ~1.14M | **~2.05M / ~1.14M** | | |
| | **Primary Debug Target** | Core matrix mult | Advanced graph | Scatter/Gather loops | **GQA Broadcast & Byte Fallback** | | |
| --- | |
| ## π Repository Structure & File Descriptions | |
| ### 1. GGUF Formats (Root Directory `./`) | |
| Binary files optimized for execution via `llama.cpp` or compatible lower-level inference engines. Upstream parsers automatically recognize this under the `mixed` (Mixtral) descriptor. | |
| | Filename | Type | Target / Validation Focus | | |
| | :--- | :--- | :--- | | |
| | **`tinymoeja2m.F32.gguf`** | `F32` | **Baseline Test.** Eliminates quantization noise to isolate and verify raw probability mathematics. | | |
| | **`tinymoeja2m.F16.gguf`**<br>**`tinymoeja2m.BF16.gguf`** | `F16`<br>`BF16` | **Half-Precision Test.** Evaluates 16-bit floating-point unpacking routines and parallelized accumulation layers. | | |
| | **`tinymoeja2m.Q8_0.gguf`** | `Q8_0` | **Standard Quantization.** Verifies block-based uniform scaling across decentralized MoE structures. | | |
| | **`tinymoeja2m.Q4_0.gguf`**<br>**`tinymoeja2m.Q4_1.gguf`** | `Q4_0`<br>`Q4_1` | **Classic 4-bit Quantization.** Tests basic linear scaling and unpacking logic across multiple discontinuous expert weight matrices. | | |
| | **`tinymoeja2m.Q2_K.gguf`** | `Q2_K` | **Standard K-Quant (2-bit).** Evaluates mixed super-block dequantization loops feeding sparse FFN routines. | | |
| | **`tinymoeja2m.Q3_K_M.gguf`** | `Q3_K_M` | **Standard K-Quant (3-bit).** Tests sub-variant multi-block layouts handling dynamic routing vectors. | | |
| | **`tinymoeja2m.Q4_K_M.gguf`** | `Q4_K_M` | **Standard K-Quant (4-bit).** Target for modern 4-bit super-block logic coupled with sparse MoE graphs. | | |
| | **`tinymoeja2m.Q5_K_M.gguf`** | `Q5_K_M` | **Standard K-Quant (5-bit).** Validates high-fidelity mixed 5-bit precision layouts. | | |
| | **`tinymoeja2m.Q6_K.gguf`** | `Q6_K` | **Standard K-Quant (6-bit).** Validates 6-bit high-fidelity super-block dequantization. | | |
| ### 2. Hugging Face Native Format (`./hf/`) | |
| Unquantized components formatted for direct instantiation inside the PyTorch `transformers` library ecosystem: | |
| * **`hf/model.safetensors`**: Raw unquantized matrix parameters containing all 4 expert sub-networks, GQA projection matrices, and the master router tensor. | |
| * **`hf/config.json`**: Architectural specifications built around `MixtralConfig`. Fully configured to enforce `num_attention_heads: 4`, `num_key_value_heads: 1`, `max_position_embeddings: 2048`, and `rope_theta: 10000.0`. | |
| * **`hf/generation_config.json`**: Standard generation defaults. | |
| * **`hf/tokenizer.model`**: The custom 1,024-vocabulary size SentencePiece BPE master binary trained on a clean Japanese text subset with `byte_fallback` enabled. | |
| * **`hf/tokenizer.json`**: Evaluated JSON-serialized token maps for high-speed interoperability across native tokenization backends. | |
| * **`hf/tokenizer_config.json`**: Enforced metadata linking `LlamaTokenizer` classes to guarantee correct handling of prefix spacing and automatic `<s>` (BOS) injection. | |
| * **`hf/special_tokens_map.json`**: Structural map linking special tokens (`<s>`=1, `</s>`=2, `<unk>`=0, `<pad>`=2). | |
| * **`./hf/`** : **Float32 (FP32) Master Subfolder.** The unquantized baseline precision weights. Highly recommended for initializing custom floating-point matrix operations without rounding loss. | |
| * **`./hf.bf16/`** : **Bfloat16 (BF16) Subfolder.** Optimized for modern hardware acceleration structures (such as Ampere/Hopper Tensor Cores or Intel Arc/Gaudi frames) to examine native 16-bit brain floating-point pipelines. | |
| * **`./hf.fp16/`** : **Float16 (FP16) Subfolder.** Ideal for standard 16-bit half-precision parallel math routines and performance evaluation profiles. | |
| * **`./hf.fp64/`** : **Float64 (FP64 / Double) Subfolder.** Retains ultra-high mathematical double precision parameters. Designed to strictly isolate hardware-level execution bugs from system accumulation errors. | |
| --- | |
| ## π― Purpose & Design Philosophy (Verification Targets) | |
| This checkpoint is specifically engineered as a deterministic validation test asset for runtime computing backends and **is not designed for practical semantic tasks.** | |
| Due to the compact parameter size (~2.05M) and ultra-focused vocabulary layout (1,024 tokens), the network concentrates its capacity entirely on mastering Japanese phrase continuations and basic syntax under an autoregressive framework. | |
| ### Critical Debugging Capabilities for Custom Engines: | |
| 1. **GQA Broadcast Matrix Multiplication** | |
| The 4:1 Grouped-Query Attention structure requires the execution kernels to correctly share a single Key/Value cache block across 4 independent Query heads. This serves as an ideal testbed for tracking memory stride offsets and tensor broadcasting alignment in parallel computing shaders. | |
| 2. **Multi-Byte UTF-8 Byte Fallback Validation** | |
| With the vocabulary limited to 1,024 tokens, any kanji or character outside the primary training subset triggers the `byte_fallback` mechanism, breaking the character down into raw sequential UTF-8 byte tokens (3 tokens per character). This enforces a rigorous stress test on the engine's streaming decoder to correctly stitch unaligned byte streams back into flawless Japanese text without truncation or corruption. | |
| --- | |
| ## π Usage Examples | |
| ### A. Running GGUF via llama.cpp | |
| To process the GQA MoE execution graph and evaluate dynamic expert routing directly on your shell: | |
| ```bash | |
| ./llama-cli -m tinymoeja2m.Q4_K_M.gguf -p "γγ γ¨γͺγͺγΌγ―" -n 64 --temp 0.0 | |
| ``` | |
| ### B. Loading Hugging Face Formats via Python | |
| ```python | |
| import torch | |
| import sentencepiece as spm | |
| from transformers import MixtralForCausalLM | |
| from huggingface_hub import hf_hub_download | |
| # Define target repository identity | |
| repo_id = "shibatch/tinymoeja2m" | |
| print("Downloading and caching specialized tokenizer layer...") | |
| # Fetch tokenizer.model file automatically from Hugging Face Hub | |
| tokenizer_file = hf_hub_download(repo_id=repo_id, subfolder="hf", filename="tokenizer.model") | |
| sp = spm.SentencePieceProcessor() | |
| sp.Load(tokenizer_file) | |
| print("Downloading and loading Mixtral-based 2M MoE model weights...") | |
| model = MixtralForCausalLM.from_pretrained(repo_id, subfolder="hf") | |
| device = "cuda" if torch.cuda.is_available() else ("xpu" if torch.xpu.is_available() else "cpu") | |
| model = model.to(device) | |
| model.eval() | |
| # Prompt text utilizing vocabulary subsets | |
| prompt = "γγ γ¨γͺγͺγΌγ―" | |
| input_ids = [1] + sp.EncodeAsIds(prompt) # Explicitly prepend BOS (1) | |
| input_tensor = torch.tensor([input_ids]).to(device) | |
| print("Executing text generation loop (Validating 4:1 GQA & Top-2 MoE Kernels)...") | |
| with torch.no_grad(): | |
| output_ids = model.generate( | |
| input_tensor, | |
| max_length=64, | |
| do_sample=False, | |
| pad_token_id=2, | |
| bos_token_id=1, | |
| eos_token_id=2 | |
| ) | |
| generated_ids = output_ids[0].cpu().tolist() | |
| generated_text = sp.DecodeIds(generated_ids) | |
| print("\n--- Inference Test Result ---") | |
| print("Prompt :", prompt) | |
| print("Generated:", generated_text) | |
| ``` | |
| --- | |
| ## π Model Specifications | |
| * **Architecture:** Mixtral (`MixtralForCausalLM`) | |
| * **Dataset:** TinyStories Japanese Translation Corpus (320k stories) | |
| * **Total Parameters (`num_local_experts` = 4):** ~2.05M | |
| * **Active Parameters (`num_experts_per_tok` = 2):** ~1.14M | |
| * **Vocabulary Size (`vocab_size`):** 1,024 (Custom SentencePiece BPE with `byte_fallback` enabled) | |
| * **Hidden Size (`hidden_size`):** 128 | |
| * **Number of Hidden Layers (`num_hidden_layers`):** 3 | |
| * **Number of Attention Heads (`num_heads` / `num_kv_heads`):** 4 / 1 *(4:1 GQA layout)* | |
| * **Individual Expert Internal Dimension (`intermediate_size`):** 352 *(SwiGLU structure)* | |
| * **Max Position Embeddings (`max_position_embeddings`):** 2,048 | |
| * **RoPE Base Frequency (`rope_theta`):** 10,000.0 | |
| ## π License | |
| * **License:** **MIT License**. You are completely free to duplicate, modify, distribute, and utilize these assets across any commercial, personal, or educational environments. | |