tinymoe2m / README.md
shibatch's picture
Upload README.md with huggingface_hub
ba5cd11 verified
metadata
license: mit
base_model: mistralai/Mixtral-8x7B-v0.1
tags:
  - mixtral
  - moe
  - gguf
  - safetensors
  - transformers
  - validation
  - test-suite

TinyStories Mixtral 2M Top-2 MoE (tinymoe2m) GGUF & HF Validation Suite (4k Context)

This repository provides an ultra-lightweight Mixtral model variant (a Mixture-of-Experts architecture utilizing the Llama 2 compute topology) scaled down to a 1.95M total parameter footprint and a 1.14M active parameter execution frame. It is trained on the TinyStories dataset and optimized as a precise validation asset.

This asset is calibrated to a 4,096 token context window (4k) with an adjusted RoPE base frequency (rope_theta) of 15,000.0 to maintain sharp localized attention coordinates.

It is designed specifically for debugging custom inference engines, and native tensor compilers against MoE-specific runtime features. These include Gating network weight allocation, token distribution/gathering (Scatter/Gather loops), and the weighted addition combining multiple independent expert outputs.


πŸ“Š Comparison: tinymoe2m vs Other 1M Variants

To help track feature coverage across the 1M/2M verification suite, the core structural layouts are outlined below:

Feature / Metric tiny1m (Standard) tinybpe1m (BPE Variant) tinygemma1m (Gemma 2 Variant) tinymoe2m (This Repository)
Base Architecture Llama 2 Llama 2 Gemma 2 Llama 2 (Mixtral Format)
FFN Structure Single FFN (Dense) Single FFN (Dense) Single FFN (Dense) Mixture-of-Experts (MoE)
Attention Mechanism MHA (Multi-Head) MHA (Multi-Head) GQA (Grouped-Query) MHA (Multi-Head)
Total Experts 1 (Non-MoE) 1 (Non-MoE) 1 (Non-MoE) 4 Experts
Selected Experts - - - Top-2 Experts
Expert FFN Dim (intermediate_size) 564 352 352 352 (Shared across all experts)
Max Position Embeddings - - - 4,096
RoPE Base (rope_theta) - - - 15,000.0
Total Parameters ~1.2M ~1.0M ~1.0M ~1.95M (1.95M Total)
Active Parameters ~1.2M ~1.0M ~1.0M ~1.14M (1.14M Active)
Primary Debug Target Core matrix mult & layout byte_fallback decode Gemma 2 advanced graph Dynamic Routing & Scatter/Gather

πŸ’‘ Compute Cost vs Capacity Optimization

With a total parameter count of approximately 1.95M, this model retains roughly twice the absolute capacity of standard 1M dense variants, allowing it to maintain a stable command of grammar rules and coherent phrasings from the TinyStories corpus. Crucially, because only the top-2 experts fire per token, the active parameter execution count is capped at ~1.14M. This layout perfectly replicates the fundamental benefit of MoE architectures: expanding a model's total internal capacity by 2x while restricting the added floating-point operation (FLOPs) overhead to just a 1.1x–1.2x increase compared to a 1M dense counterpart.


πŸ“‚ Repository Structure & File Descriptions

1. GGUF Formats (Root Directory ./)

Binary files optimized for execution via llama.cpp or compatible lower-level inference engines. Upstream parsers will automatically recognize this architecture under the mixed (Mixtral) type descriptor.

Filename Type Size Target / Validation Focus
tinymoe2m.F32.gguf F32 ~8.0 MB Baseline Test. Eliminates quantization noise to isolate and verify the raw probability mathematics of the Gating network and expert tensor synthesis.
tinymoe2m.F16.gguf
tinymoe2m.BF16.gguf
F16
BF16
~4.0 MB Half-Precision Test. Evaluates 16-bit floating-point unpacking routines and stability under parallelized accumulation layers.
tinymoe2m.Q8_0.gguf Q8_0 ~2.2 MB Standard Quantization. Verifies block-based uniform scaling (32-element blocks) across decentralized MoE structures.
tinymoe2m.Q4_0.gguf
tinymoe2m.Q4_1.gguf
Q4_0
Q4_1
~1.4 MB Classic Quantization. Tests 4-bit linear scaling and unpacking logic across multiple discontinuous expert weight matrices.
tinymoe2m.Q2_K.gguf Q2_K ~1.1 MB Standard K-Quant (2-bit). Evaluates mixed super-block dequantization loops feeding sparse FFN routines.
tinymoe2m.Q3_K_M.gguf Q3_K_M ~1.2 MB Standard K-Quant (3-bit). Tests sub-variant multi-block layouts handling dynamic routing vectors.
tinymoe2m.Q4_K_M.gguf Q4_K_M ~1.4 MB Standard K-Quant (4-bit). The baseline testing target for modern 4-bit super-block logic coupled with MoE paths.
tinymoe2m.Q5_K_M.gguf Q5_K_M ~1.5 MB Standard K-Quant (5-bit). Validates high-fidelity mixed 5-bit precision layouts.
tinymoe2m.Q6_K.gguf Q6_K ~1.7 MB Standard K-Quant (6-bit). Validates 6-bit high-fidelity super-block dequantization.

2. Hugging Face Native Format (./hf/)

Unquantized components formatted for direct instantiation inside the PyTorch transformers library ecosystem:

  • hf/model.safetensors: Raw unquantized matrix parameters containing all 4 expert sub-networks alongside the master router tensor.
  • hf/config.json: Architectural specifications built around MixtralConfig criteria (layer depth, head maps, absolute expert counts, and top-k selection targets). Fully updated to enforce max_position_embeddings: 4096 and rope_theta: 15000.0.
  • hf/generation_config.json: Standard generation defaults.
  • hf/tokenizer.model: The custom 512-vocabulary size SentencePiece BPE master binary.
  • hf/tokenizer_config.json: Metadata linking LlamaTokenizer classes to guarantee correct handling of prefix spacing and manage automatic <s> (BOS) injection properly on the Hugging Face backend. Configured with model_max_length: 4096.
  • hf/special_tokens_map.json: Structural map linking token strings (<s>=1, </s>=2) back to internal index bounds.

🎯 Purpose & Design Philosophy (Verification Targets)

This checkpoint is specifically engineered as a deterministic validation test asset for computing platforms and is not designed for long-context semantic extraction tasks (such as Needle-in-a-Haystack password retrieval).

Due to the extreme capacity boundaries (~1.95M total parameters) and ultra-compact vocabulary layout (512 tokens), the internal network matrices allocate their expressiveness exclusively toward mastering English syntax and high-frequency phrases. It lacks the multi-layer, high-order dynamic copy induction circuits required to trace out-of-context injection strings or narrow characters across large windows.

Expected Token Output Behavior

When processed with template phrases containing temporary password identifiers like: "The magic password of the giant was key X. I remember that the magic password of the giant was"

The network will cleanly bypass copying the literal character X and instead continue generating standard learned unigram-biased blocks such as "about to go home. Every day...". This is mathematically expected behavior. Validation is achieved strictly via Bit-Exact Logit Verification across runtime backends to confirm matching compute kernels, KV cache memory indices, causal attention layers, and precise RoPE phase calculation.


πŸš€ Usage Examples

A. Running GGUF via llama.cpp

To process the MoE execution graph and evaluate dynamic expert routing directly on your shell:

./llama-cli -m tinymoe2m.Q4_K_M.gguf -p "Tom and Jerry are " -n 64 --temp 0.0

B. Loading Hugging Face Formats via Python

Because the configuration parameters are seamlessly matched with the custom vocabulary schema, you can invoke the classes using standard automated loaders without building proprietary wrapper systems.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "shibatch/tinymoe2m"

print("Loading MoE configuration and tokenizer layers...")
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="hf")
model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="hf")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

prompt = "Tom and Jerry are "
inputs = tokenizer(prompt, return_tensors="pt").to(device)

print("Running inference loop (Validating Top-2 sparse routing matrices)...")
with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_length=64, 
        do_sample=False
    )
    
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("\n--- Inference Test Result ---")
print("Prompt   :", prompt)
print("Generated:", generated_text)

πŸ“ Model Specifications

  • Architecture: Mixtral (MixtralForCausalLM)
  • Dataset: TinyStories
  • Total Parameters (num_local_experts = 4): ~1.95M
  • Active Parameters (num_experts_per_tok = 2): ~1.14M
  • Vocabulary Size (vocab_size): 512 (Custom SentencePiece BPE with byte_fallback enabled)
  • Hidden Size (hidden_size): 128
  • Number of Hidden Layers (num_hidden_layers): 3
  • Number of Attention Heads (num_heads / num_kv_heads): 2 / 2 (MHA layout)
  • Individual Expert Internal Dimension (intermediate_size): 352 (SwiGLU structure)
  • Max Position Embeddings (max_position_embeddings): 4,096
  • RoPE Base Frequency (rope_theta): 15,000.0

πŸ“œ License

  • License: MIT License. You are completely free to duplicate, modify, distribute, and utilize these assets across any commercial, personal, or educational environments.