|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- text-generation |
|
|
- causal-lm |
|
|
- llama |
|
|
- transformer |
|
|
- pytorch |
|
|
- sft |
|
|
- instruction-tuned |
|
|
- flash-attention |
|
|
- gguf-compatible |
|
|
pipeline_tag: text-generation |
|
|
datasets: |
|
|
- HuggingFaceFW/fineweb-edu |
|
|
- wikimedia/wikipedia |
|
|
- Nikity/Kyoto-Corpus |
|
|
- lmsys/lmsys-chat-1m |
|
|
- guus4324343/Nomi-150M-Chat |
|
|
- aklein4/chat-compilation |
|
|
model-index: |
|
|
- name: Monostich-100M |
|
|
results: [] |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
# Monostich 100M |
|
|
|
|
|
### A Compact Instruction-Tuned Language Model |
|
|
|
|
|
[](.) |
|
|
[](LICENSE) |
|
|
[](https://pytorch.org) |
|
|
[](https://github.com/ggerganov/llama.cpp) |
|
|
|
|
|
*A from-scratch LLaMA-style language model pretrained on 16.6B tokens and instruction-tuned on multi-turn chat data* |
|
|
|
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## Overview |
|
|
|
|
|
**Monostich** is a ~100M parameter decoder-only transformer trained entirely from scratch. It uses a LLaMA-compatible architecture with modern components (GQA, RoPE, SwiGLU, RMSNorm) and is designed to be lightweight. |
|
|
|
|
|
- **Pretraining**: ~16.6B tokens from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) + [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) |
|
|
- **SFT**: Multi-turn instruction tuning on 5 mixed datasets with Llama-3-style chat templates |
|
|
- **Chat template**: Llama-3 style — `<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n` |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
**Pipeline:** `Chat Prompt` → `BPE-32K Tokenizer` → `LLaMA Decoder (12L)` → `Token Prediction` |
|
|
|
|
|
### Decoder Block (×12) |
|
|
|
|
|
Each transformer layer contains: |
|
|
- **Grouped Query Attention** with RoPE positional embeddings (12 Q heads, 4 KV heads) |
|
|
- **SwiGLU MLP** with gated activation (768 → 2048 → 768) |
|
|
- **RMSNorm** pre-attention and pre-MLP |
|
|
- **SDPA** backend (Flash Attention when available) |
|
|
|
|
|
### Technical Specifications |
|
|
|
|
|
<table> |
|
|
<tr><td><b>Architecture</b></td><td>LLaMA-style Decoder-Only Transformer</td></tr> |
|
|
<tr><td><b>Parameters</b></td><td>100,092,672 (~100M)</td></tr> |
|
|
<tr><td><b>Hidden Dimension</b></td><td>768</td></tr> |
|
|
<tr><td><b>Intermediate (MLP)</b></td><td>2,048</td></tr> |
|
|
<tr><td><b>Layers</b></td><td>12</td></tr> |
|
|
<tr><td><b>Attention Heads</b></td><td>12 (Q) / 4 (KV) — GQA 3:1</td></tr> |
|
|
<tr><td><b>Head Dimension</b></td><td>64</td></tr> |
|
|
<tr><td><b>Context Length</b></td><td>1024</td></tr> |
|
|
<tr><td><b>RoPE θ</b></td><td>10,000</td></tr> |
|
|
<tr><td><b>Vocabulary</b></td><td>32,000 (BPE)</td></tr> |
|
|
<tr><td><b>Tied Embeddings</b></td><td>Yes</td></tr> |
|
|
<tr><td><b>Precision</b></td><td>bfloat16</td></tr> |
|
|
<tr><td><b>Weight Size</b></td><td>~191 MiB (bf16)</td></tr> |
|
|
</table> |
|
|
|
|
|
### Design Choices |
|
|
|
|
|
<table> |
|
|
<tr><th>Feature</th><th>Description</th><th>Origin</th></tr> |
|
|
<tr><td><b>RoPE</b></td><td>Rotary Positional Embeddings for relative position encoding</td><td>LLaMA</td></tr> |
|
|
<tr><td><b>GQA</b></td><td>Grouped Query Attention (3:1) for efficient KV cache</td><td>LLaMA-2</td></tr> |
|
|
<tr><td><b>SwiGLU</b></td><td>Gated linear unit with SiLU activation</td><td>PaLM, LLaMA</td></tr> |
|
|
<tr><td><b>RMSNorm</b></td><td>Root Mean Square normalization (faster than LayerNorm)</td><td>LLaMA</td></tr> |
|
|
<tr><td><b>Flash Attention</b></td><td>Memory-efficient attention via PyTorch SDPA</td><td>Dao et al.</td></tr> |
|
|
<tr><td><b>Weight Tying</b></td><td>Embedding and LM head share weights</td><td>Standard</td></tr> |
|
|
</table> |
|
|
|
|
|
--- |
|
|
|
|
|
## Tokenizer |
|
|
|
|
|
<table> |
|
|
<tr><td><b>Type</b></td><td>Byte-Pair Encoding (BPE)</td></tr> |
|
|
<tr><td><b>Vocabulary</b></td><td>32,000 tokens</td></tr> |
|
|
<tr><td><b>Library</b></td><td>HuggingFace <code>tokenizers</code></td></tr> |
|
|
</table> |
|
|
|
|
|
### Special Tokens |
|
|
|
|
|
<table> |
|
|
<tr><th>Token</th><th>ID</th><th>Purpose</th></tr> |
|
|
<tr><td><code><|pad|></code></td><td>0</td><td>Padding</td></tr> |
|
|
<tr><td><code><|unk|></code></td><td>1</td><td>Unknown</td></tr> |
|
|
<tr><td><code><|begin_of_text|></code></td><td>2</td><td>Beginning of text</td></tr> |
|
|
<tr><td><code><|end_of_text|></code></td><td>3</td><td>End of text (document boundary)</td></tr> |
|
|
<tr><td><code><|start_header_id|></code></td><td>4</td><td>Chat role header open</td></tr> |
|
|
<tr><td><code><|end_header_id|></code></td><td>5</td><td>Chat role header close</td></tr> |
|
|
<tr><td><code><|eot_id|></code></td><td>6</td><td>End of turn (generation stop token)</td></tr> |
|
|
</table> |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Phase 1: Pretraining |
|
|
|
|
|
<table> |
|
|
<tr><td><b>Dataset</b></td><td><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">FineWeb-Edu</a> + <a href="https://huggingface.co/datasets/wikimedia/wikipedia">Wikipedia</a></td></tr> |
|
|
<tr><td><b>Tokens</b></td><td>~16.6B (~11.6B FineWeb-Edu + ~5B Wikipedia)</td></tr> |
|
|
<tr><td><b>Context Length</b></td><td>1024</td></tr> |
|
|
<tr><td><b>Objective</b></td><td>Next-token prediction (all tokens)</td></tr> |
|
|
<tr><td><b>Peak LR</b></td><td>3 × 10<sup>-4</sup></td></tr> |
|
|
<tr><td><b>Min LR</b></td><td>3 × 10<sup>-5</sup></td></tr> |
|
|
<tr><td><b>Warmup</b></td><td>200 steps</td></tr> |
|
|
<tr><td><b>Schedule</b></td><td>Warmup → Plateau (10%) → Cosine Decay</td></tr> |
|
|
</table> |
|
|
|
|
|
### Phase 2: Supervised Fine-Tuning (SFT) |
|
|
|
|
|
<table> |
|
|
<tr><td><b>Datasets</b></td><td>Kyoto-Corpus + LMSYS-Chat-1M + Nomi-150M-Chat + Chat-Compilation</td></tr> |
|
|
<tr><td><b>Context Length</b></td><td>1024</td></tr> |
|
|
<tr><td><b>Objective</b></td><td>Masked cross-entropy (assistant tokens only)</td></tr> |
|
|
<tr><td><b>Chat Template</b></td><td>Llama-3 style with header tokens</td></tr> |
|
|
<tr><td><b>Peak LR</b></td><td>5 × 10<sup>-5</sup></td></tr> |
|
|
<tr><td><b>Min LR</b></td><td>5 × 10<sup>-6</sup></td></tr> |
|
|
<tr><td><b>Warmup</b></td><td>100 steps</td></tr> |
|
|
<tr><td><b>Schedule</b></td><td>Warmup → Cosine Decay</td></tr> |
|
|
</table> |
|
|
|
|
|
### Shared Training Config |
|
|
|
|
|
<table> |
|
|
<tr><td><b>Optimizer</b></td><td>AdamW (fused) — β&sub1;=0.9, β&sub2;=0.95, ε=10<sup>-8</sup></td></tr> |
|
|
<tr><td><b>Weight Decay</b></td><td>0.0</td></tr> |
|
|
<tr><td><b>Gradient Clipping</b></td><td>1.0 (global norm)</td></tr> |
|
|
<tr><td><b>Precision</b></td><td>bfloat16 autocast</td></tr> |
|
|
<tr><td><b>Compilation</b></td><td>Optional <code>torch.compile</code> (max-autotune)</td></tr> |
|
|
<tr><td><b>Multi-GPU</b></td><td>Automatic DDP when ≥2 GPUs detected</td></tr> |
|
|
</table> |
|
|
|
|
|
### SFT Datasets |
|
|
|
|
|
<table> |
|
|
<tr><th>Dataset</th><th>Source</th><th>Notes</th></tr> |
|
|
<tr><td><b>Kyoto-Corpus</b></td><td><a href="https://huggingface.co/datasets/Nikity/Kyoto-Corpus">Nikity/Kyoto-Corpus</a></td><td>Multi-turn instruction pairs</td></tr> |
|
|
<tr><td><b>LMSYS-Chat-1M</b></td><td><a href="https://huggingface.co/datasets/lmsys/lmsys-chat-1m">lmsys/lmsys-chat-1m</a></td><td>Real-world conversations (redacted rows skipped)</td></tr> |
|
|
<tr><td><b>Nomi-150M-Chat</b></td><td><a href="https://huggingface.co/datasets/guus4324343/Nomi-150M-Chat">guus4324343/Nomi-150M-Chat</a></td><td>Synthetic chat data</td></tr> |
|
|
<tr><td><b>Chat-Compilation</b></td><td><a href="https://huggingface.co/datasets/aklein4/chat-compilation">aklein4/chat-compilation</a></td><td>Multi-source compilation (system-prompt conversations excluded)</td></tr> |
|
|
</table> |
|
|
|
|
|
--- |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install torch safetensors tokenizers huggingface_hub |
|
|
``` |
|
|
|
|
|
### Run |
|
|
|
|
|
```bash |
|
|
wget https://huggingface.co/kerzgrr/monostich/resolve/main/inference.py |
|
|
python inference.py |
|
|
``` |
|
|
|
|
|
The script downloads the model, tokenizer, and config from Hugging Face automatically (cached after first run). |
|
|
|
|
|
### Usage |
|
|
|
|
|
**Interactive chat** (default): |
|
|
|
|
|
```bash |
|
|
python inference.py |
|
|
``` |
|
|
|
|
|
**Single prompt**: |
|
|
|
|
|
```bash |
|
|
python inference.py --prompt "What is the capital of France?" |
|
|
``` |
|
|
|
|
|
**Options:** |
|
|
|
|
|
| Flag | Default | Description | |
|
|
|------|---------|-------------| |
|
|
| `--prompt` | None | Single prompt (omit for interactive REPL) | |
|
|
| `--temperature` | 0.28 | Sampling temperature | |
|
|
| `--top-p` | 0.95 | Nucleus sampling threshold | |
|
|
| `--max-new-tokens` | context max | Max tokens to generate | |
|
|
| `--device` | cuda | Device (`cuda` or `cpu`) | |
|
|
| `--seed` | 1234 | Random seed | |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Family |
|
|
|
|
|
<table> |
|
|
<tr><th>Model</th><th>Parameters</th><th>Context</th><th>Status</th></tr> |
|
|
<tr><td><b>Monostich</b></td><td>~100M</td><td>1024</td><td>Available</td></tr> |
|
|
<tr><td><b>Couplet</b></td><td>~200M</td><td>1024</td><td>Training</td></tr> |
|
|
</table> |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Scale**: At 100M parameters this model is a research prototype, not a production system |
|
|
|
|
|
--- |
|
|
|
|
|
## File Contents |
|
|
|
|
|
``` |
|
|
kerzgrr/monostich/ |
|
|
README.md # This model card |
|
|
inference.py # Standalone inference script |
|
|
monostich.safetensors # Weights (bfloat16, SafeTensors) |
|
|
config.json # Model architecture config |
|
|
tokenizer.json # BPE tokenizer (HuggingFace format) |
|
|
tokenizer_config.json # Tokenizer metadata |
|
|
special_token_ids.json # Token ID mapping |
|
|
special_tokens_map.json # Token string mapping |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{monostich2026, |
|
|
title={Monostich: A Compact Instruction-Tuned Language Model}, |
|
|
year={2026}, |
|
|
url={https://huggingface.co/kerzgrr/monostich} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
Built on: |
|
|
- **LLaMA** architecture (Meta AI) |
|
|
- **FineWeb-Edu** dataset (HuggingFace) |
|
|
- **Wikipedia** dataset (Wikimedia) |
|
|
- **Kyoto-Corpus** (Nikity) |
|
|
- **LMSYS-Chat-1M** (LMSYS) |
|
|
- **Nomi-150M-Chat** (guus4324343) |
|
|
- **Chat-Compilation** (aklein4) |
|
|
- **PyTorch** SDPA / Flash Attention |
|
|
- **HuggingFace** tokenizers and hub |
|
|
|
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
*A monostich is a poem of a single line — small, but complete.* |
|
|
|
|
|
</div> |
|
|
|