--- license: apache-2.0 language: - en tags: - text-generation - causal-lm - llama - transformer - pytorch - sft - instruction-tuned - flash-attention - gguf-compatible pipeline_tag: text-generation datasets: - HuggingFaceFW/fineweb-edu - wikimedia/wikipedia - Nikity/Kyoto-Corpus - lmsys/lmsys-chat-1m - guus4324343/Nomi-150M-Chat - aklein4/chat-compilation model-index: - name: Monostich-100M results: [] ---

# Monostich 100M ### A Compact Instruction-Tuned Language Model [![Model](https://img.shields.io/badge/Model-100M_params-blue)](.) [![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](LICENSE) [![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org) [![GGUF](https://img.shields.io/badge/GGUF-Compatible-orange.svg)](https://github.com/ggerganov/llama.cpp) *A from-scratch LLaMA-style language model pretrained on 16.6B tokens and instruction-tuned on multi-turn chat data*

--- ## Overview **Monostich** is a ~100M parameter decoder-only transformer trained entirely from scratch. It uses a LLaMA-compatible architecture with modern components (GQA, RoPE, SwiGLU, RMSNorm) and is designed to be lightweight. - **Pretraining**: ~16.6B tokens from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) + [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) - **SFT**: Multi-turn instruction tuning on 5 mixed datasets with Llama-3-style chat templates - **Chat template**: Llama-3 style — `<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n` --- ## Model Architecture **Pipeline:** `Chat Prompt` → `BPE-32K Tokenizer` → `LLaMA Decoder (12L)` → `Token Prediction` ### Decoder Block (×12) Each transformer layer contains: - **Grouped Query Attention** with RoPE positional embeddings (12 Q heads, 4 KV heads) - **SwiGLU MLP** with gated activation (768 → 2048 → 768) - **RMSNorm** pre-attention and pre-MLP - **SDPA** backend (Flash Attention when available) ### Technical Specifications

Architecture	LLaMA-style Decoder-Only Transformer
Parameters	100,092,672 (~100M)
Hidden Dimension	768
Intermediate (MLP)	2,048
Layers	12
Attention Heads	12 (Q) / 4 (KV) — GQA 3:1
Head Dimension	64
Context Length	1024
RoPE θ	10,000
Vocabulary	32,000 (BPE)
Tied Embeddings	Yes
Precision	bfloat16
Weight Size	~191 MiB (bf16)

### Design Choices

Feature	Description	Origin
RoPE	Rotary Positional Embeddings for relative position encoding	LLaMA
GQA	Grouped Query Attention (3:1) for efficient KV cache	LLaMA-2
SwiGLU	Gated linear unit with SiLU activation	PaLM, LLaMA
RMSNorm	Root Mean Square normalization (faster than LayerNorm)	LLaMA
Flash Attention	Memory-efficient attention via PyTorch SDPA	Dao et al.
Weight Tying	Embedding and LM head share weights	Standard

--- ## Tokenizer

Type	Byte-Pair Encoding (BPE)
Vocabulary	32,000 tokens
Library	HuggingFace `tokenizers`

### Special Tokens

Token	ID	Purpose
`<\|pad\|>`	0	Padding
`<\|unk\|>`	1	Unknown
`<\|begin_of_text\|>`	2	Beginning of text
`<\|end_of_text\|>`	3	End of text (document boundary)
`<\|start_header_id\|>`	4	Chat role header open
`<\|end_header_id\|>`	5	Chat role header close
`<\|eot_id\|>`	6	End of turn (generation stop token)

--- ## Training Details ### Phase 1: Pretraining

Dataset	FineWeb-Edu + Wikipedia
Tokens	~16.6B (~11.6B FineWeb-Edu + ~5B Wikipedia)
Context Length	1024
Objective	Next-token prediction (all tokens)
Peak LR	3 × 10^-4
Min LR	3 × 10^-5
Warmup	200 steps
Schedule	Warmup → Plateau (10%) → Cosine Decay

### Phase 2: Supervised Fine-Tuning (SFT)

Datasets	Kyoto-Corpus + LMSYS-Chat-1M + Nomi-150M-Chat + Chat-Compilation
Context Length	1024
Objective	Masked cross-entropy (assistant tokens only)
Chat Template	Llama-3 style with header tokens
Peak LR	5 × 10^-5
Min LR	5 × 10^-6
Warmup	100 steps
Schedule	Warmup → Cosine Decay

### Shared Training Config

Optimizer	AdamW (fused) — β&sub1;=0.9, β&sub2;=0.95, ε=10^-8
Weight Decay	0.0
Gradient Clipping	1.0 (global norm)
Precision	bfloat16 autocast
Compilation	Optional `torch.compile` (max-autotune)
Multi-GPU	Automatic DDP when ≥2 GPUs detected

### SFT Datasets

Dataset	Source	Notes
Kyoto-Corpus	Nikity/Kyoto-Corpus	Multi-turn instruction pairs
LMSYS-Chat-1M	lmsys/lmsys-chat-1m	Real-world conversations (redacted rows skipped)
Nomi-150M-Chat	guus4324343/Nomi-150M-Chat	Synthetic chat data
Chat-Compilation	aklein4/chat-compilation	Multi-source compilation (system-prompt conversations excluded)

--- ## Quick Start ### Installation ```bash pip install torch safetensors tokenizers huggingface_hub ``` ### Run ```bash wget https://huggingface.co/kerzgrr/monostich/resolve/main/inference.py python inference.py ``` The script downloads the model, tokenizer, and config from Hugging Face automatically (cached after first run). ### Usage **Interactive chat** (default): ```bash python inference.py ``` **Single prompt**: ```bash python inference.py --prompt "What is the capital of France?" ``` **Options:** | Flag | Default | Description | |------|---------|-------------| | `--prompt` | None | Single prompt (omit for interactive REPL) | | `--temperature` | 0.28 | Sampling temperature | | `--top-p` | 0.95 | Nucleus sampling threshold | | `--max-new-tokens` | context max | Max tokens to generate | | `--device` | cuda | Device (`cuda` or `cpu`) | | `--seed` | 1234 | Random seed | --- ## Model Family

Model	Parameters	Context	Status
Monostich	~100M	1024	Available
Couplet	~200M	1024	Training

--- ## Limitations - **Scale**: At 100M parameters this model is a research prototype, not a production system --- ## File Contents ``` kerzgrr/monostich/ README.md # This model card inference.py # Standalone inference script monostich.safetensors # Weights (bfloat16, SafeTensors) config.json # Model architecture config tokenizer.json # BPE tokenizer (HuggingFace format) tokenizer_config.json # Tokenizer metadata special_token_ids.json # Token ID mapping special_tokens_map.json # Token string mapping ``` --- ## Citation ```bibtex @misc{monostich2026, title={Monostich: A Compact Instruction-Tuned Language Model}, year={2026}, url={https://huggingface.co/kerzgrr/monostich} } ``` --- ## Acknowledgments Built on: - **LLaMA** architecture (Meta AI) - **FineWeb-Edu** dataset (HuggingFace) - **Wikipedia** dataset (Wikimedia) - **Kyoto-Corpus** (Nikity) - **LMSYS-Chat-1M** (LMSYS) - **Nomi-150M-Chat** (guus4324343) - **Chat-Compilation** (aklein4) - **PyTorch** SDPA / Flash Attention - **HuggingFace** tokenizers and hub ---

*A monostich is a poem of a single line — small, but complete.*