MicroMixer-1 Logo

MicroMixer-1-500K-TinyStories

Parameters Architecture Dataset

Micro Language Model
Attention-Free β€’ MLP-Only β€’ Byte-Level

GitHub


πŸ“‹ Overview

MicroMixer-1-500K is a mid-sized model with ~500K parameters. Basic sentence structures like "Once upon a time there was a little boy named Sammy" begin to emerge.


πŸ—οΈ Architecture

graph TD
    A[Byte Input] --> B[Token Embedding]
    B --> C[RoPE Position Encoding]
    C --> D[ImprovedMixerLayer Γ—3]
    D --> E[LayerNorm]
    E --> F[LM Head]
    F --> G[Byte Output]
    
    style A fill:#007BFF,color:#fff
    style G fill:#00D620,color:#fff
    style D fill:#AE00FF,color:#fff

Model Configuration

Parameter Value
Total Parameters557,328
Hidden Dimension176
Channel MLP Dimension384
Number of Layers3
Max Sequence Length128
Vocabulary Size256 (Byte-level)

Core Components

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           ImprovedMixerLayer                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  LayerNorm β†’ HyperMixing β†’ Residual β”‚    β”‚ ← Token Mixing
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚
β”‚  β”‚  LayerNorm β†’ MlpBlock β†’ Residual    β”‚    β”‚ ← Channel Mixing
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1️⃣ RoPE (Rotary Position Embedding)

  • Encodes positions via rotation transformations
  • Enables length extrapolation beyond training sequences

2️⃣ HyperMixing (Token Mixing)

  • Compresses past context via cumulative average pooling
  • Hypernetwork generates adaptive weights
  • O(S) complexity token mixing without attention

3️⃣ MlpBlock (Channel Mixing)

  • Non-linear transformation of feature dimensions
  • Structure: Linear β†’ GELU β†’ Linear

πŸ“ˆ Key Differences from 300K

Metric 300K 500K Change
Parameters 331,680 557,328 1.7x
Hidden Dim 128 176 1.4x
Channel MLP 288 384 1.3x
Sequence Length 128 128 Same

🎯 Generation Examples

Prompt: "Once upon a time"
Output: "Once upon a time there was a little boy named Sammy. Limmy love..."

Prompt: "The weather is"
Output: "The weather isy any ve vea laund ye Tha veaverd p shewave cand"

⚠️ Limitations

Limitation Description
Grammatical Errors Fully grammatical sentences still difficult
Unstable Names Names like "Sammy", "Lily" generated inconsistently
Short Prompt Issues Short prompts like "Hello" produce near-random output
Overfitting Overfits to specific TinyStories phrases

πŸ“Š Training Data

Dataset: TinyStories

  • Simple children's stories dataset
  • Learns basic grammar and vocabulary
  • Contains many patterns like "Once upon a time", "little girl/boy"

πŸ”§ Usage

import torch
from huggingface_hub import hf_hub_download
from src.model import MicroMixerV2, MicroMixerV2Config
from src.tokenizer import ByteTokenizer

# Clone the repository first:
# git clone https://github.com/llaa33219/MicroMixer-1.git
# cd MicroMixer-1

config = MicroMixerV2Config(
    max_seq_len=128,
    hidden_dim=176,
    channel_mlp_dim=384,
    num_layers=3,
    use_hyper=True,
)

model = MicroMixerV2(config)
weights_path = hf_hub_download("llaa33219/MicroMixer-1-500K-TinyStories", "model.pt")
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()

tokenizer = ByteTokenizer()
input_ids = torch.tensor([tokenizer.encode("Once upon a time")])

with torch.no_grad():
    output = model.generate(input_ids, max_new_tokens=64, temperature=0.8, top_k=40)

print(tokenizer.decode(output[0].tolist()))


GitHub

Part of the MicroMixer-1 research project

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train llaa33219/MicroMixer-1-500k-TinyStories

Collection including llaa33219/MicroMixer-1-500k-TinyStories