Monostich 100M
A Compact Instruction-Tuned Language Model
A from-scratch LLaMA-style language model pretrained on 16.6B tokens and instruction-tuned on multi-turn chat data
Overview
Monostich is a ~100M parameter decoder-only transformer trained entirely from scratch. It uses a LLaMA-compatible architecture with modern components (GQA, RoPE, SwiGLU, RMSNorm) and is designed to be lightweight.
- Pretraining: ~16.6B tokens from FineWeb-Edu + Wikipedia
- SFT: Multi-turn instruction tuning on 5 mixed datasets with Llama-3-style chat templates
- Chat template: Llama-3 style —
<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n
Model Architecture
Pipeline: Chat Prompt → BPE-32K Tokenizer → LLaMA Decoder (12L) → Token Prediction
Decoder Block (×12)
Each transformer layer contains:
- Grouped Query Attention with RoPE positional embeddings (12 Q heads, 4 KV heads)
- SwiGLU MLP with gated activation (768 → 2048 → 768)
- RMSNorm pre-attention and pre-MLP
- SDPA backend (Flash Attention when available)
Technical Specifications
| Architecture | LLaMA-style Decoder-Only Transformer |
| Parameters | 100,092,672 (~100M) |
| Hidden Dimension | 768 |
| Intermediate (MLP) | 2,048 |
| Layers | 12 |
| Attention Heads | 12 (Q) / 4 (KV) — GQA 3:1 |
| Head Dimension | 64 |
| Context Length | 1024 |
| RoPE θ | 10,000 |
| Vocabulary | 32,000 (BPE) |
| Tied Embeddings | Yes |
| Precision | bfloat16 |
| Weight Size | ~191 MiB (bf16) |
Design Choices
| Feature | Description | Origin |
|---|---|---|
| RoPE | Rotary Positional Embeddings for relative position encoding | LLaMA |
| GQA | Grouped Query Attention (3:1) for efficient KV cache | LLaMA-2 |
| SwiGLU | Gated linear unit with SiLU activation | PaLM, LLaMA |
| RMSNorm | Root Mean Square normalization (faster than LayerNorm) | LLaMA |
| Flash Attention | Memory-efficient attention via PyTorch SDPA | Dao et al. |
| Weight Tying | Embedding and LM head share weights | Standard |
Tokenizer
| Type | Byte-Pair Encoding (BPE) |
| Vocabulary | 32,000 tokens |
| Library | HuggingFace tokenizers |
Special Tokens
| Token | ID | Purpose |
|---|---|---|
<|pad|> | 0 | Padding |
<|unk|> | 1 | Unknown |
<|begin_of_text|> | 2 | Beginning of text |
<|end_of_text|> | 3 | End of text (document boundary) |
<|start_header_id|> | 4 | Chat role header open |
<|end_header_id|> | 5 | Chat role header close |
<|eot_id|> | 6 | End of turn (generation stop token) |
Training Details
Phase 1: Pretraining
| Dataset | FineWeb-Edu + Wikipedia |
| Tokens | ~16.6B (~11.6B FineWeb-Edu + ~5B Wikipedia) |
| Context Length | 1024 |
| Objective | Next-token prediction (all tokens) |
| Peak LR | 3 × 10-4 |
| Min LR | 3 × 10-5 |
| Warmup | 200 steps |
| Schedule | Warmup → Plateau (10%) → Cosine Decay |
Phase 2: Supervised Fine-Tuning (SFT)
| Datasets | Kyoto-Corpus + LMSYS-Chat-1M + Nomi-150M-Chat + Chat-Compilation |
| Context Length | 1024 |
| Objective | Masked cross-entropy (assistant tokens only) |
| Chat Template | Llama-3 style with header tokens |
| Peak LR | 5 × 10-5 |
| Min LR | 5 × 10-6 |
| Warmup | 100 steps |
| Schedule | Warmup → Cosine Decay |
Shared Training Config
| Optimizer | AdamW (fused) — β&sub1;=0.9, β&sub2;=0.95, ε=10-8 |
| Weight Decay | 0.0 |
| Gradient Clipping | 1.0 (global norm) |
| Precision | bfloat16 autocast |
| Compilation | Optional torch.compile (max-autotune) |
| Multi-GPU | Automatic DDP when ≥2 GPUs detected |
SFT Datasets
| Dataset | Source | Notes |
|---|---|---|
| Kyoto-Corpus | Nikity/Kyoto-Corpus | Multi-turn instruction pairs |
| LMSYS-Chat-1M | lmsys/lmsys-chat-1m | Real-world conversations (redacted rows skipped) |
| Nomi-150M-Chat | guus4324343/Nomi-150M-Chat | Synthetic chat data |
| Chat-Compilation | aklein4/chat-compilation | Multi-source compilation (system-prompt conversations excluded) |
Quick Start
Installation
pip install torch safetensors tokenizers huggingface_hub
Run
wget https://huggingface.co/kerzgrr/monostich/resolve/main/inference.py
python inference.py
The script downloads the model, tokenizer, and config from Hugging Face automatically (cached after first run).
Usage
Interactive chat (default):
python inference.py
Single prompt:
python inference.py --prompt "What is the capital of France?"
Options:
| Flag | Default | Description |
|---|---|---|
--prompt |
None | Single prompt (omit for interactive REPL) |
--temperature |
0.28 | Sampling temperature |
--top-p |
0.95 | Nucleus sampling threshold |
--max-new-tokens |
context max | Max tokens to generate |
--device |
cuda | Device (cuda or cpu) |
--seed |
1234 | Random seed |
Model Family
| Model | Parameters | Context | Status |
|---|---|---|---|
| Monostich | ~100M | 1024 | Available |
| Couplet | ~200M | 1024 | Training |
Limitations
- Scale: At 100M parameters this model is a research prototype, not a production system
File Contents
kerzgrr/monostich/
README.md # This model card
inference.py # Standalone inference script
monostich.safetensors # Weights (bfloat16, SafeTensors)
config.json # Model architecture config
tokenizer.json # BPE tokenizer (HuggingFace format)
tokenizer_config.json # Tokenizer metadata
special_token_ids.json # Token ID mapping
special_tokens_map.json # Token string mapping
Citation
@misc{monostich2026,
title={Monostich: A Compact Instruction-Tuned Language Model},
year={2026},
url={https://huggingface.co/kerzgrr/monostich}
}
Acknowledgments
Built on:
- LLaMA architecture (Meta AI)
- FineWeb-Edu dataset (HuggingFace)
- Wikipedia dataset (Wikimedia)
- Kyoto-Corpus (Nikity)
- LMSYS-Chat-1M (LMSYS)
- Nomi-150M-Chat (guus4324343)
- Chat-Compilation (aklein4)
- PyTorch SDPA / Flash Attention
- HuggingFace tokenizers and hub
A monostich is a poem of a single line — small, but complete.
- Downloads last month
- 116