roneneldan/TinyStories
Viewer β’ Updated β’ 2.14M β’ 89.6k β’ 1.01k
MicroMixer-1-100K is the smallest model in the MicroMixer series, with only 136K parameters. It performs causal language modeling using only MLP layers, without any attention mechanisms.
graph TD
A[Byte Input] --> B[Token Embedding]
B --> C[RoPE Position Encoding]
C --> D[ImprovedMixerLayer Γ3]
D --> E[LayerNorm]
E --> F[LM Head]
F --> G[Byte Output]
style A fill:#007BFF,color:#fff
style G fill:#00D620,color:#fff
style D fill:#AE00FF,color:#fff
| Parameter | Value |
|---|---|
| Total Parameters | 136,908 |
| Hidden Dimension | 84 |
| Channel MLP Dimension | 128 |
| Number of Layers | 3 |
| Max Sequence Length | 64 |
| Vocabulary Size | 256 (Byte-level) |
βββββββββββββββββββββββββββββββββββββββββββββββ
β ImprovedMixerLayer β
β βββββββββββββββββββββββββββββββββββββββ β
β β LayerNorm β HyperMixing β Residual β β β Token Mixing
β βββββββββββββββββββββββββββββββββββββββ€ β
β β LayerNorm β MlpBlock β Residual β β β Channel Mixing
β βββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββ
Linear β GELU β Linear| Limitation | Description |
|---|---|
| Extremely Small | 136K parameters cannot capture complex language patterns |
| Short Sequences | max_seq_len=64 limits context understanding |
| Grammatical Errors | Generated text is mostly ungrammatical |
| Repetitive Patterns | Repeats specific phrases from training data |
Dataset: TinyStories
import torch
from huggingface_hub import hf_hub_download
from src.model import MicroMixerV2, MicroMixerV2Config
from src.tokenizer import ByteTokenizer
# Clone the repository first:
# git clone https://github.com/llaa33219/MicroMixer-1.git
# cd MicroMixer-1
config = MicroMixerV2Config(
max_seq_len=64,
hidden_dim=84,
channel_mlp_dim=128,
num_layers=3,
use_hyper=True,
)
model = MicroMixerV2(config)
weights_path = hf_hub_download("llaa33219/MicroMixer-1-100K-TinyStories", "model.pt")
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()
tokenizer = ByteTokenizer()
input_ids = torch.tensor([tokenizer.encode("Once upon a time")])
with torch.no_grad():
output = model.generate(input_ids, max_new_tokens=64, temperature=0.8, top_k=40)
print(tokenizer.decode(output[0].tolist()))
Part of the MicroMixer-1 research project