MicroMixer-1-100K-TinyStories

Micro Language Model
Attention-Free • MLP-Only • Byte-Level

📋 Overview

MicroMixer-1-100K is the smallest model in the MicroMixer series, with only 136K parameters. It performs causal language modeling using only MLP layers, without any attention mechanisms.

🏗️ Architecture

graph TD
    A[Byte Input] --> B[Token Embedding]
    B --> C[RoPE Position Encoding]
    C --> D[ImprovedMixerLayer ×3]
    D --> E[LayerNorm]
    E --> F[LM Head]
    F --> G[Byte Output]
    
    style A fill:#007BFF,color:#fff
    style G fill:#00D620,color:#fff
    style D fill:#AE00FF,color:#fff

Model Configuration

Parameter	Value
Total Parameters	`136,908`
Hidden Dimension	`84`
Channel MLP Dimension	`128`
Number of Layers	`3`
Max Sequence Length	`64`
Vocabulary Size	`256` (Byte-level)

Core Components

┌─────────────────────────────────────────────┐
│           ImprovedMixerLayer                 │
│  ┌─────────────────────────────────────┐    │
│  │  LayerNorm → HyperMixing → Residual │    │ ← Token Mixing
│  ├─────────────────────────────────────┤    │
│  │  LayerNorm → MlpBlock → Residual    │    │ ← Channel Mixing
│  └─────────────────────────────────────┘    │
└─────────────────────────────────────────────┘

1️⃣ RoPE (Rotary Position Embedding)

Encodes positions via rotation transformations
Enables length extrapolation beyond training sequences

2️⃣ HyperMixing (Token Mixing)

Compresses past context via cumulative average pooling
Hypernetwork generates adaptive weights
O(S) complexity token mixing without attention

3️⃣ MlpBlock (Channel Mixing)

Non-linear transformation of feature dimensions
Structure: Linear → GELU → Linear

⚠️ Limitations

Limitation	Description
Extremely Small	136K parameters cannot capture complex language patterns
Short Sequences	max_seq_len=64 limits context understanding
Grammatical Errors	Generated text is mostly ungrammatical
Repetitive Patterns	Repeats specific phrases from training data

📊 Training Data

Dataset: TinyStories

Simple children's stories dataset
Learns basic grammar and vocabulary
Contains many patterns like "Once upon a time", "little girl/boy"

🔧 Usage

import torch
from huggingface_hub import hf_hub_download
from src.model import MicroMixerV2, MicroMixerV2Config
from src.tokenizer import ByteTokenizer

# Clone the repository first:
# git clone https://github.com/llaa33219/MicroMixer-1.git
# cd MicroMixer-1

config = MicroMixerV2Config(
    max_seq_len=64,
    hidden_dim=84,
    channel_mlp_dim=128,
    num_layers=3,
    use_hyper=True,
)

model = MicroMixerV2(config)
weights_path = hf_hub_download("llaa33219/MicroMixer-1-100K-TinyStories", "model.pt")
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()

tokenizer = ByteTokenizer()
input_ids = torch.tensor([tokenizer.encode("Once upon a time")])

with torch.no_grad():
    output = model.generate(input_ids, max_new_tokens=64, temperature=0.8, top_k=40)

print(tokenizer.decode(output[0].tolist()))

_{Part of the MicroMixer-1 research project}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

llaa33219
/

MicroMixer-1-100k-TinyStories