roneneldan/TinyStories
Viewer β’ Updated β’ 2.14M β’ 90.1k β’ 1.01k
MicroMixer-1-300K is a medium-sized model with 2.4x more parameters than the 100K variant. It can process 128 token sequences and begins to learn basic word patterns.
graph TD
A[Byte Input] --> B[Token Embedding]
B --> C[RoPE Position Encoding]
C --> D[ImprovedMixerLayer Γ3]
D --> E[LayerNorm]
E --> F[LM Head]
F --> G[Byte Output]
style A fill:#007BFF,color:#fff
style G fill:#00D620,color:#fff
style D fill:#AE00FF,color:#fff
| Parameter | Value |
|---|---|
| Total Parameters | 331,680 |
| Hidden Dimension | 128 |
| Channel MLP Dimension | 288 |
| Number of Layers | 3 |
| Max Sequence Length | 128 |
| Vocabulary Size | 256 (Byte-level) |
βββββββββββββββββββββββββββββββββββββββββββββββ
β ImprovedMixerLayer β
β βββββββββββββββββββββββββββββββββββββββ β
β β LayerNorm β HyperMixing β Residual β β β Token Mixing
β βββββββββββββββββββββββββββββββββββββββ€ β
β β LayerNorm β MlpBlock β Residual β β β Channel Mixing
β βββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββ
Linear β GELU β Linear| Metric | 100K | 300K | Change |
|---|---|---|---|
| Parameters | 136,908 | 331,680 | 2.4x |
| Hidden Dim | 84 | 128 | 1.5x |
| Channel MLP | 128 | 288 | 2.3x |
| Sequence Length | 64 | 128 | 2x |
| Limitation | Description |
|---|---|
| Limited Context | 128 tokens still insufficient for complex context |
| Unstable Generation | Word patterns appear but sentence completion is poor |
| Vocabulary Gaps | Rare characters handled poorly despite byte-level encoding |
| Repetitive Output | Repeats patterns like "little girl named Timmy" |
Dataset: TinyStories
import torch
from huggingface_hub import hf_hub_download
from src.model import MicroMixerV2, MicroMixerV2Config
from src.tokenizer import ByteTokenizer
# Clone the repository first:
# git clone https://github.com/llaa33219/MicroMixer-1.git
# cd MicroMixer-1
config = MicroMixerV2Config(
max_seq_len=128,
hidden_dim=128,
channel_mlp_dim=288,
num_layers=3,
use_hyper=True,
)
model = MicroMixerV2(config)
weights_path = hf_hub_download("llaa33219/MicroMixer-1-300K-TinyStories", "model.pt")
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()
tokenizer = ByteTokenizer()
input_ids = torch.tensor([tokenizer.encode("Once upon a time")])
with torch.no_grad():
output = model.generate(input_ids, max_new_tokens=64, temperature=0.8, top_k=40)
print(tokenizer.decode(output[0].tolist()))
Part of the MicroMixer-1 research project