MicroMixer-1-500K-TinyStories

Micro Language Model
Attention-Free • MLP-Only • Byte-Level

📋 Overview

MicroMixer-1-500K is a mid-sized model with ~500K parameters. Basic sentence structures like "Once upon a time there was a little boy named Sammy" begin to emerge.

🏗️ Architecture

graph TD
    A[Byte Input] --> B[Token Embedding]
    B --> C[RoPE Position Encoding]
    C --> D[ImprovedMixerLayer ×3]
    D --> E[LayerNorm]
    E --> F[LM Head]
    F --> G[Byte Output]
    
    style A fill:#007BFF,color:#fff
    style G fill:#00D620,color:#fff
    style D fill:#AE00FF,color:#fff

Model Configuration

Parameter	Value
Total Parameters	`557,328`
Hidden Dimension	`176`
Channel MLP Dimension	`384`
Number of Layers	`3`
Max Sequence Length	`128`
Vocabulary Size	`256` (Byte-level)

Core Components

┌─────────────────────────────────────────────┐
│           ImprovedMixerLayer                 │
│  ┌─────────────────────────────────────┐    │
│  │  LayerNorm → HyperMixing → Residual │    │ ← Token Mixing
│  ├─────────────────────────────────────┤    │
│  │  LayerNorm → MlpBlock → Residual    │    │ ← Channel Mixing
│  └─────────────────────────────────────┘    │
└─────────────────────────────────────────────┘

1️⃣ RoPE (Rotary Position Embedding)

Encodes positions via rotation transformations
Enables length extrapolation beyond training sequences

2️⃣ HyperMixing (Token Mixing)

Compresses past context via cumulative average pooling
Hypernetwork generates adaptive weights
O(S) complexity token mixing without attention

3️⃣ MlpBlock (Channel Mixing)

Non-linear transformation of feature dimensions
Structure: Linear → GELU → Linear

📈 Key Differences from 300K

Metric	300K	500K	Change
Parameters	331,680	557,328	1.7x
Hidden Dim	128	176	1.4x
Channel MLP	288	384	1.3x
Sequence Length	128	128	Same

🎯 Generation Examples

Prompt: "Once upon a time"
Output: "Once upon a time there was a little boy named Sammy. Limmy love..."

Prompt: "The weather is"
Output: "The weather isy any ve vea laund ye Tha veaverd p shewave cand"

⚠️ Limitations

Limitation	Description
Grammatical Errors	Fully grammatical sentences still difficult
Unstable Names	Names like "Sammy", "Lily" generated inconsistently
Short Prompt Issues	Short prompts like "Hello" produce near-random output
Overfitting	Overfits to specific TinyStories phrases

📊 Training Data

Dataset: TinyStories

Simple children's stories dataset
Learns basic grammar and vocabulary
Contains many patterns like "Once upon a time", "little girl/boy"

🔧 Usage

import torch
from huggingface_hub import hf_hub_download
from src.model import MicroMixerV2, MicroMixerV2Config
from src.tokenizer import ByteTokenizer

# Clone the repository first:
# git clone https://github.com/llaa33219/MicroMixer-1.git
# cd MicroMixer-1

config = MicroMixerV2Config(
    max_seq_len=128,
    hidden_dim=176,
    channel_mlp_dim=384,
    num_layers=3,
    use_hyper=True,
)

model = MicroMixerV2(config)
weights_path = hf_hub_download("llaa33219/MicroMixer-1-500K-TinyStories", "model.pt")
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()

tokenizer = ByteTokenizer()
input_ids = torch.tensor([tokenizer.encode("Once upon a time")])

with torch.no_grad():
    output = model.generate(input_ids, max_new_tokens=64, temperature=0.8, top_k=40)

print(tokenizer.decode(output[0].tolist()))

_{Part of the MicroMixer-1 research project}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

llaa33219
/

MicroMixer-1-500k-TinyStories