---
license: apache-2.0
language:
- en
tags:
- text-generation
- causal-lm
- llama
- transformer
- pytorch
- sft
- instruction-tuned
- flash-attention
- gguf-compatible
pipeline_tag: text-generation
datasets:
- HuggingFaceFW/fineweb-edu
- wikimedia/wikipedia
- Nikity/Kyoto-Corpus
- lmsys/lmsys-chat-1m
- guus4324343/Nomi-150M-Chat
- aklein4/chat-compilation
model-index:
- name: Monostich-100M
results: []
---
# Monostich 100M
### A Compact Instruction-Tuned Language Model
[](.)
[](LICENSE)
[](https://pytorch.org)
[](https://github.com/ggerganov/llama.cpp)
*A from-scratch LLaMA-style language model pretrained on 16.6B tokens and instruction-tuned on multi-turn chat data*
---
## Overview
**Monostich** is a ~100M parameter decoder-only transformer trained entirely from scratch. It uses a LLaMA-compatible architecture with modern components (GQA, RoPE, SwiGLU, RMSNorm) and is designed to be lightweight.
- **Pretraining**: ~16.6B tokens from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) + [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)
- **SFT**: Multi-turn instruction tuning on 5 mixed datasets with Llama-3-style chat templates
- **Chat template**: Llama-3 style — `<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n`
---
## Model Architecture
**Pipeline:** `Chat Prompt` → `BPE-32K Tokenizer` → `LLaMA Decoder (12L)` → `Token Prediction`
### Decoder Block (×12)
Each transformer layer contains:
- **Grouped Query Attention** with RoPE positional embeddings (12 Q heads, 4 KV heads)
- **SwiGLU MLP** with gated activation (768 → 2048 → 768)
- **RMSNorm** pre-attention and pre-MLP
- **SDPA** backend (Flash Attention when available)
### Technical Specifications
| Architecture | LLaMA-style Decoder-Only Transformer |
| Parameters | 100,092,672 (~100M) |
| Hidden Dimension | 768 |
| Intermediate (MLP) | 2,048 |
| Layers | 12 |
| Attention Heads | 12 (Q) / 4 (KV) — GQA 3:1 |
| Head Dimension | 64 |
| Context Length | 1024 |
| RoPE θ | 10,000 |
| Vocabulary | 32,000 (BPE) |
| Tied Embeddings | Yes |
| Precision | bfloat16 |
| Weight Size | ~191 MiB (bf16) |
### Design Choices
| Feature | Description | Origin |
| RoPE | Rotary Positional Embeddings for relative position encoding | LLaMA |
| GQA | Grouped Query Attention (3:1) for efficient KV cache | LLaMA-2 |
| SwiGLU | Gated linear unit with SiLU activation | PaLM, LLaMA |
| RMSNorm | Root Mean Square normalization (faster than LayerNorm) | LLaMA |
| Flash Attention | Memory-efficient attention via PyTorch SDPA | Dao et al. |
| Weight Tying | Embedding and LM head share weights | Standard |
---
## Tokenizer
| Type | Byte-Pair Encoding (BPE) |
| Vocabulary | 32,000 tokens |
| Library | HuggingFace tokenizers |
### Special Tokens
| Token | ID | Purpose |
<|pad|> | 0 | Padding |
<|unk|> | 1 | Unknown |
<|begin_of_text|> | 2 | Beginning of text |
<|end_of_text|> | 3 | End of text (document boundary) |
<|start_header_id|> | 4 | Chat role header open |
<|end_header_id|> | 5 | Chat role header close |
<|eot_id|> | 6 | End of turn (generation stop token) |
---
## Training Details
### Phase 1: Pretraining
| Dataset | FineWeb-Edu + Wikipedia |
| Tokens | ~16.6B (~11.6B FineWeb-Edu + ~5B Wikipedia) |
| Context Length | 1024 |
| Objective | Next-token prediction (all tokens) |
| Peak LR | 3 × 10-4 |
| Min LR | 3 × 10-5 |
| Warmup | 200 steps |
| Schedule | Warmup → Plateau (10%) → Cosine Decay |
### Phase 2: Supervised Fine-Tuning (SFT)
| Datasets | Kyoto-Corpus + LMSYS-Chat-1M + Nomi-150M-Chat + Chat-Compilation |
| Context Length | 1024 |
| Objective | Masked cross-entropy (assistant tokens only) |
| Chat Template | Llama-3 style with header tokens |
| Peak LR | 5 × 10-5 |
| Min LR | 5 × 10-6 |
| Warmup | 100 steps |
| Schedule | Warmup → Cosine Decay |
### Shared Training Config
| Optimizer | AdamW (fused) — β&sub1;=0.9, β&sub2;=0.95, ε=10-8 |
| Weight Decay | 0.0 |
| Gradient Clipping | 1.0 (global norm) |
| Precision | bfloat16 autocast |
| Compilation | Optional torch.compile (max-autotune) |
| Multi-GPU | Automatic DDP when ≥2 GPUs detected |
### SFT Datasets
---
## Quick Start
### Installation
```bash
pip install torch safetensors tokenizers huggingface_hub
```
### Run
```bash
wget https://huggingface.co/kerzgrr/monostich/resolve/main/inference.py
python inference.py
```
The script downloads the model, tokenizer, and config from Hugging Face automatically (cached after first run).
### Usage
**Interactive chat** (default):
```bash
python inference.py
```
**Single prompt**:
```bash
python inference.py --prompt "What is the capital of France?"
```
**Options:**
| Flag | Default | Description |
|------|---------|-------------|
| `--prompt` | None | Single prompt (omit for interactive REPL) |
| `--temperature` | 0.28 | Sampling temperature |
| `--top-p` | 0.95 | Nucleus sampling threshold |
| `--max-new-tokens` | context max | Max tokens to generate |
| `--device` | cuda | Device (`cuda` or `cpu`) |
| `--seed` | 1234 | Random seed |
---
## Model Family
| Model | Parameters | Context | Status |
| Monostich | ~100M | 1024 | Available |
| Couplet | ~200M | 1024 | Training |
---
## Limitations
- **Scale**: At 100M parameters this model is a research prototype, not a production system
---
## File Contents
```
kerzgrr/monostich/
README.md # This model card
inference.py # Standalone inference script
monostich.safetensors # Weights (bfloat16, SafeTensors)
config.json # Model architecture config
tokenizer.json # BPE tokenizer (HuggingFace format)
tokenizer_config.json # Tokenizer metadata
special_token_ids.json # Token ID mapping
special_tokens_map.json # Token string mapping
```
---
## Citation
```bibtex
@misc{monostich2026,
title={Monostich: A Compact Instruction-Tuned Language Model},
year={2026},
url={https://huggingface.co/kerzgrr/monostich}
}
```
---
## Acknowledgments
Built on:
- **LLaMA** architecture (Meta AI)
- **FineWeb-Edu** dataset (HuggingFace)
- **Wikipedia** dataset (Wikimedia)
- **Kyoto-Corpus** (Nikity)
- **LMSYS-Chat-1M** (LMSYS)
- **Nomi-150M-Chat** (guus4324343)
- **Chat-Compilation** (aklein4)
- **PyTorch** SDPA / Flash Attention
- **HuggingFace** tokenizers and hub
---
*A monostich is a poem of a single line — small, but complete.*