MicroMixer-1 Logo

MicroMixer-1-100K-TinyStories

Parameters Architecture Dataset

Micro Language Model
Attention-Free β€’ MLP-Only β€’ Byte-Level

GitHub


πŸ“‹ Overview

MicroMixer-1-100K is the smallest model in the MicroMixer series, with only 136K parameters. It performs causal language modeling using only MLP layers, without any attention mechanisms.


πŸ—οΈ Architecture

graph TD
    A[Byte Input] --> B[Token Embedding]
    B --> C[RoPE Position Encoding]
    C --> D[ImprovedMixerLayer Γ—3]
    D --> E[LayerNorm]
    E --> F[LM Head]
    F --> G[Byte Output]
    
    style A fill:#007BFF,color:#fff
    style G fill:#00D620,color:#fff
    style D fill:#AE00FF,color:#fff

Model Configuration

Parameter Value
Total Parameters136,908
Hidden Dimension84
Channel MLP Dimension128
Number of Layers3
Max Sequence Length64
Vocabulary Size256 (Byte-level)

Core Components

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           ImprovedMixerLayer                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  LayerNorm β†’ HyperMixing β†’ Residual β”‚    β”‚ ← Token Mixing
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚
β”‚  β”‚  LayerNorm β†’ MlpBlock β†’ Residual    β”‚    β”‚ ← Channel Mixing
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1️⃣ RoPE (Rotary Position Embedding)

  • Encodes positions via rotation transformations
  • Enables length extrapolation beyond training sequences

2️⃣ HyperMixing (Token Mixing)

  • Compresses past context via cumulative average pooling
  • Hypernetwork generates adaptive weights
  • O(S) complexity token mixing without attention

3️⃣ MlpBlock (Channel Mixing)

  • Non-linear transformation of feature dimensions
  • Structure: Linear β†’ GELU β†’ Linear

⚠️ Limitations

Limitation Description
Extremely Small 136K parameters cannot capture complex language patterns
Short Sequences max_seq_len=64 limits context understanding
Grammatical Errors Generated text is mostly ungrammatical
Repetitive Patterns Repeats specific phrases from training data

πŸ“Š Training Data

Dataset: TinyStories

  • Simple children's stories dataset
  • Learns basic grammar and vocabulary
  • Contains many patterns like "Once upon a time", "little girl/boy"

πŸ”§ Usage

import torch
from huggingface_hub import hf_hub_download
from src.model import MicroMixerV2, MicroMixerV2Config
from src.tokenizer import ByteTokenizer

# Clone the repository first:
# git clone https://github.com/llaa33219/MicroMixer-1.git
# cd MicroMixer-1

config = MicroMixerV2Config(
    max_seq_len=64,
    hidden_dim=84,
    channel_mlp_dim=128,
    num_layers=3,
    use_hyper=True,
)

model = MicroMixerV2(config)
weights_path = hf_hub_download("llaa33219/MicroMixer-1-100K-TinyStories", "model.pt")
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()

tokenizer = ByteTokenizer()
input_ids = torch.tensor([tokenizer.encode("Once upon a time")])

with torch.no_grad():
    output = model.generate(input_ids, max_new_tokens=64, temperature=0.8, top_k=40)

print(tokenizer.decode(output[0].tolist()))


GitHub

Part of the MicroMixer-1 research project

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train llaa33219/MicroMixer-1-100k-TinyStories

Collection including llaa33219/MicroMixer-1-100k-TinyStories