Monostich / README.md
kerzgrr's picture
Update README.md
8c5e9b4 verified
---
license: apache-2.0
language:
- en
tags:
- text-generation
- causal-lm
- llama
- transformer
- pytorch
- sft
- instruction-tuned
- flash-attention
- gguf-compatible
pipeline_tag: text-generation
datasets:
- HuggingFaceFW/fineweb-edu
- wikimedia/wikipedia
- Nikity/Kyoto-Corpus
- lmsys/lmsys-chat-1m
- guus4324343/Nomi-150M-Chat
- aklein4/chat-compilation
model-index:
- name: Monostich-100M
results: []
---
<div align="center">
# Monostich 100M
### A Compact Instruction-Tuned Language Model
[![Model](https://img.shields.io/badge/Model-100M_params-blue)](.)
[![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](LICENSE)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org)
[![GGUF](https://img.shields.io/badge/GGUF-Compatible-orange.svg)](https://github.com/ggerganov/llama.cpp)
*A from-scratch LLaMA-style language model pretrained on 16.6B tokens and instruction-tuned on multi-turn chat data*
</div>
---
## Overview
**Monostich** is a ~100M parameter decoder-only transformer trained entirely from scratch. It uses a LLaMA-compatible architecture with modern components (GQA, RoPE, SwiGLU, RMSNorm) and is designed to be lightweight.
- **Pretraining**: ~16.6B tokens from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) + [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)
- **SFT**: Multi-turn instruction tuning on 5 mixed datasets with Llama-3-style chat templates
- **Chat template**: Llama-3 style &mdash; `<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n`
---
## Model Architecture
**Pipeline:** `Chat Prompt` &rarr; `BPE-32K Tokenizer` &rarr; `LLaMA Decoder (12L)` &rarr; `Token Prediction`
### Decoder Block (&times;12)
Each transformer layer contains:
- **Grouped Query Attention** with RoPE positional embeddings (12 Q heads, 4 KV heads)
- **SwiGLU MLP** with gated activation (768 &rarr; 2048 &rarr; 768)
- **RMSNorm** pre-attention and pre-MLP
- **SDPA** backend (Flash Attention when available)
### Technical Specifications
<table>
<tr><td><b>Architecture</b></td><td>LLaMA-style Decoder-Only Transformer</td></tr>
<tr><td><b>Parameters</b></td><td>100,092,672 (~100M)</td></tr>
<tr><td><b>Hidden Dimension</b></td><td>768</td></tr>
<tr><td><b>Intermediate (MLP)</b></td><td>2,048</td></tr>
<tr><td><b>Layers</b></td><td>12</td></tr>
<tr><td><b>Attention Heads</b></td><td>12 (Q) / 4 (KV) &mdash; GQA 3:1</td></tr>
<tr><td><b>Head Dimension</b></td><td>64</td></tr>
<tr><td><b>Context Length</b></td><td>1024</td></tr>
<tr><td><b>RoPE &theta;</b></td><td>10,000</td></tr>
<tr><td><b>Vocabulary</b></td><td>32,000 (BPE)</td></tr>
<tr><td><b>Tied Embeddings</b></td><td>Yes</td></tr>
<tr><td><b>Precision</b></td><td>bfloat16</td></tr>
<tr><td><b>Weight Size</b></td><td>~191 MiB (bf16)</td></tr>
</table>
### Design Choices
<table>
<tr><th>Feature</th><th>Description</th><th>Origin</th></tr>
<tr><td><b>RoPE</b></td><td>Rotary Positional Embeddings for relative position encoding</td><td>LLaMA</td></tr>
<tr><td><b>GQA</b></td><td>Grouped Query Attention (3:1) for efficient KV cache</td><td>LLaMA-2</td></tr>
<tr><td><b>SwiGLU</b></td><td>Gated linear unit with SiLU activation</td><td>PaLM, LLaMA</td></tr>
<tr><td><b>RMSNorm</b></td><td>Root Mean Square normalization (faster than LayerNorm)</td><td>LLaMA</td></tr>
<tr><td><b>Flash Attention</b></td><td>Memory-efficient attention via PyTorch SDPA</td><td>Dao et al.</td></tr>
<tr><td><b>Weight Tying</b></td><td>Embedding and LM head share weights</td><td>Standard</td></tr>
</table>
---
## Tokenizer
<table>
<tr><td><b>Type</b></td><td>Byte-Pair Encoding (BPE)</td></tr>
<tr><td><b>Vocabulary</b></td><td>32,000 tokens</td></tr>
<tr><td><b>Library</b></td><td>HuggingFace <code>tokenizers</code></td></tr>
</table>
### Special Tokens
<table>
<tr><th>Token</th><th>ID</th><th>Purpose</th></tr>
<tr><td><code>&lt;|pad|&gt;</code></td><td>0</td><td>Padding</td></tr>
<tr><td><code>&lt;|unk|&gt;</code></td><td>1</td><td>Unknown</td></tr>
<tr><td><code>&lt;|begin_of_text|&gt;</code></td><td>2</td><td>Beginning of text</td></tr>
<tr><td><code>&lt;|end_of_text|&gt;</code></td><td>3</td><td>End of text (document boundary)</td></tr>
<tr><td><code>&lt;|start_header_id|&gt;</code></td><td>4</td><td>Chat role header open</td></tr>
<tr><td><code>&lt;|end_header_id|&gt;</code></td><td>5</td><td>Chat role header close</td></tr>
<tr><td><code>&lt;|eot_id|&gt;</code></td><td>6</td><td>End of turn (generation stop token)</td></tr>
</table>
---
## Training Details
### Phase 1: Pretraining
<table>
<tr><td><b>Dataset</b></td><td><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">FineWeb-Edu</a> + <a href="https://huggingface.co/datasets/wikimedia/wikipedia">Wikipedia</a></td></tr>
<tr><td><b>Tokens</b></td><td>~16.6B (~11.6B FineWeb-Edu + ~5B Wikipedia)</td></tr>
<tr><td><b>Context Length</b></td><td>1024</td></tr>
<tr><td><b>Objective</b></td><td>Next-token prediction (all tokens)</td></tr>
<tr><td><b>Peak LR</b></td><td>3 &times; 10<sup>-4</sup></td></tr>
<tr><td><b>Min LR</b></td><td>3 &times; 10<sup>-5</sup></td></tr>
<tr><td><b>Warmup</b></td><td>200 steps</td></tr>
<tr><td><b>Schedule</b></td><td>Warmup &rarr; Plateau (10%) &rarr; Cosine Decay</td></tr>
</table>
### Phase 2: Supervised Fine-Tuning (SFT)
<table>
<tr><td><b>Datasets</b></td><td>Kyoto-Corpus + LMSYS-Chat-1M + Nomi-150M-Chat + Chat-Compilation</td></tr>
<tr><td><b>Context Length</b></td><td>1024</td></tr>
<tr><td><b>Objective</b></td><td>Masked cross-entropy (assistant tokens only)</td></tr>
<tr><td><b>Chat Template</b></td><td>Llama-3 style with header tokens</td></tr>
<tr><td><b>Peak LR</b></td><td>5 &times; 10<sup>-5</sup></td></tr>
<tr><td><b>Min LR</b></td><td>5 &times; 10<sup>-6</sup></td></tr>
<tr><td><b>Warmup</b></td><td>100 steps</td></tr>
<tr><td><b>Schedule</b></td><td>Warmup &rarr; Cosine Decay</td></tr>
</table>
### Shared Training Config
<table>
<tr><td><b>Optimizer</b></td><td>AdamW (fused) &mdash; &beta;&sub1;=0.9, &beta;&sub2;=0.95, &epsilon;=10<sup>-8</sup></td></tr>
<tr><td><b>Weight Decay</b></td><td>0.0</td></tr>
<tr><td><b>Gradient Clipping</b></td><td>1.0 (global norm)</td></tr>
<tr><td><b>Precision</b></td><td>bfloat16 autocast</td></tr>
<tr><td><b>Compilation</b></td><td>Optional <code>torch.compile</code> (max-autotune)</td></tr>
<tr><td><b>Multi-GPU</b></td><td>Automatic DDP when &ge;2 GPUs detected</td></tr>
</table>
### SFT Datasets
<table>
<tr><th>Dataset</th><th>Source</th><th>Notes</th></tr>
<tr><td><b>Kyoto-Corpus</b></td><td><a href="https://huggingface.co/datasets/Nikity/Kyoto-Corpus">Nikity/Kyoto-Corpus</a></td><td>Multi-turn instruction pairs</td></tr>
<tr><td><b>LMSYS-Chat-1M</b></td><td><a href="https://huggingface.co/datasets/lmsys/lmsys-chat-1m">lmsys/lmsys-chat-1m</a></td><td>Real-world conversations (redacted rows skipped)</td></tr>
<tr><td><b>Nomi-150M-Chat</b></td><td><a href="https://huggingface.co/datasets/guus4324343/Nomi-150M-Chat">guus4324343/Nomi-150M-Chat</a></td><td>Synthetic chat data</td></tr>
<tr><td><b>Chat-Compilation</b></td><td><a href="https://huggingface.co/datasets/aklein4/chat-compilation">aklein4/chat-compilation</a></td><td>Multi-source compilation (system-prompt conversations excluded)</td></tr>
</table>
---
## Quick Start
### Installation
```bash
pip install torch safetensors tokenizers huggingface_hub
```
### Run
```bash
wget https://huggingface.co/kerzgrr/monostich/resolve/main/inference.py
python inference.py
```
The script downloads the model, tokenizer, and config from Hugging Face automatically (cached after first run).
### Usage
**Interactive chat** (default):
```bash
python inference.py
```
**Single prompt**:
```bash
python inference.py --prompt "What is the capital of France?"
```
**Options:**
| Flag | Default | Description |
|------|---------|-------------|
| `--prompt` | None | Single prompt (omit for interactive REPL) |
| `--temperature` | 0.28 | Sampling temperature |
| `--top-p` | 0.95 | Nucleus sampling threshold |
| `--max-new-tokens` | context max | Max tokens to generate |
| `--device` | cuda | Device (`cuda` or `cpu`) |
| `--seed` | 1234 | Random seed |
---
## Model Family
<table>
<tr><th>Model</th><th>Parameters</th><th>Context</th><th>Status</th></tr>
<tr><td><b>Monostich</b></td><td>~100M</td><td>1024</td><td>Available</td></tr>
<tr><td><b>Couplet</b></td><td>~200M</td><td>1024</td><td>Training</td></tr>
</table>
---
## Limitations
- **Scale**: At 100M parameters this model is a research prototype, not a production system
---
## File Contents
```
kerzgrr/monostich/
README.md # This model card
inference.py # Standalone inference script
monostich.safetensors # Weights (bfloat16, SafeTensors)
config.json # Model architecture config
tokenizer.json # BPE tokenizer (HuggingFace format)
tokenizer_config.json # Tokenizer metadata
special_token_ids.json # Token ID mapping
special_tokens_map.json # Token string mapping
```
---
## Citation
```bibtex
@misc{monostich2026,
title={Monostich: A Compact Instruction-Tuned Language Model},
year={2026},
url={https://huggingface.co/kerzgrr/monostich}
}
```
---
## Acknowledgments
Built on:
- **LLaMA** architecture (Meta AI)
- **FineWeb-Edu** dataset (HuggingFace)
- **Wikipedia** dataset (Wikimedia)
- **Kyoto-Corpus** (Nikity)
- **LMSYS-Chat-1M** (LMSYS)
- **Nomi-150M-Chat** (guus4324343)
- **Chat-Compilation** (aklein4)
- **PyTorch** SDPA / Flash Attention
- **HuggingFace** tokenizers and hub
---
<div align="center">
*A monostich is a poem of a single line &mdash; small, but complete.*
</div>