MicroMixer-1-300K-TinyStories

Micro Language Model
Attention-Free • MLP-Only • Byte-Level

📋 Overview

MicroMixer-1-300K is a medium-sized model with 2.4x more parameters than the 100K variant. It can process 128 token sequences and begins to learn basic word patterns.

🏗️ Architecture

graph TD
    A[Byte Input] --> B[Token Embedding]
    B --> C[RoPE Position Encoding]
    C --> D[ImprovedMixerLayer ×3]
    D --> E[LayerNorm]
    E --> F[LM Head]
    F --> G[Byte Output]
    
    style A fill:#007BFF,color:#fff
    style G fill:#00D620,color:#fff
    style D fill:#AE00FF,color:#fff

Model Configuration

Parameter	Value
Total Parameters	`331,680`
Hidden Dimension	`128`
Channel MLP Dimension	`288`
Number of Layers	`3`
Max Sequence Length	`128`
Vocabulary Size	`256` (Byte-level)

Core Components

┌─────────────────────────────────────────────┐
│           ImprovedMixerLayer                 │
│  ┌─────────────────────────────────────┐    │
│  │  LayerNorm → HyperMixing → Residual │    │ ← Token Mixing
│  ├─────────────────────────────────────┤    │
│  │  LayerNorm → MlpBlock → Residual    │    │ ← Channel Mixing
│  └─────────────────────────────────────┘    │
└─────────────────────────────────────────────┘

1️⃣ RoPE (Rotary Position Embedding)

Encodes positions via rotation transformations
Enables length extrapolation beyond training sequences

2️⃣ HyperMixing (Token Mixing)

Compresses past context via cumulative average pooling
Hypernetwork generates adaptive weights
O(S) complexity token mixing without attention

3️⃣ MlpBlock (Channel Mixing)

Non-linear transformation of feature dimensions
Structure: Linear → GELU → Linear

📈 Key Differences from 100K

Metric	100K	300K	Change
Parameters	136,908	331,680	2.4x
Hidden Dim	84	128	1.5x
Channel MLP	128	288	2.3x
Sequence Length	64	128	2x

⚠️ Limitations

Limitation	Description
Limited Context	128 tokens still insufficient for complex context
Unstable Generation	Word patterns appear but sentence completion is poor
Vocabulary Gaps	Rare characters handled poorly despite byte-level encoding
Repetitive Output	Repeats patterns like "little girl named Timmy"

📊 Training Data

Dataset: TinyStories

Simple children's stories dataset
Learns basic grammar and vocabulary
Contains many patterns like "Once upon a time", "little girl/boy"

🔧 Usage

import torch
from huggingface_hub import hf_hub_download
from src.model import MicroMixerV2, MicroMixerV2Config
from src.tokenizer import ByteTokenizer

# Clone the repository first:
# git clone https://github.com/llaa33219/MicroMixer-1.git
# cd MicroMixer-1

config = MicroMixerV2Config(
    max_seq_len=128,
    hidden_dim=128,
    channel_mlp_dim=288,
    num_layers=3,
    use_hyper=True,
)

model = MicroMixerV2(config)
weights_path = hf_hub_download("llaa33219/MicroMixer-1-300K-TinyStories", "model.pt")
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()

tokenizer = ByteTokenizer()
input_ids = torch.tensor([tokenizer.encode("Once upon a time")])

with torch.no_grad():
    output = model.generate(input_ids, max_new_tokens=64, temperature=0.8, top_k=40)

print(tokenizer.decode(output[0].tolist()))

_{Part of the MicroMixer-1 research project}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

llaa33219
/

MicroMixer-1-300k-TinyStories