Monostich / README.md

Update README.md

8c5e9b4 verified 15 days ago

9.82 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- text-generation
	- causal-lm
	- llama
	- transformer
	- pytorch
	- sft
	- instruction-tuned
	- flash-attention
	- gguf-compatible
	pipeline_tag: text-generation
	datasets:
	- HuggingFaceFW/fineweb-edu
	- wikimedia/wikipedia
	- Nikity/Kyoto-Corpus
	- lmsys/lmsys-chat-1m
	- guus4324343/Nomi-150M-Chat
	- aklein4/chat-compilation
	model-index:
	- name: Monostich-100M
	results: []
	---

	<div align="center">

	# Monostich 100M

	### A Compact Instruction-Tuned Language Model

	[![Model](https://img.shields.io/badge/Model-100M_params-blue)](.)
	[![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](LICENSE)
	[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org)
	[![GGUF](https://img.shields.io/badge/GGUF-Compatible-orange.svg)](https://github.com/ggerganov/llama.cpp)

	A from-scratch LLaMA-style language model pretrained on 16.6B tokens and instruction-tuned on multi-turn chat data

	</div>

	---

	## Overview

	Monostich is a ~100M parameter decoder-only transformer trained entirely from scratch. It uses a LLaMA-compatible architecture with modern components (GQA, RoPE, SwiGLU, RMSNorm) and is designed to be lightweight.

	- Pretraining: ~16.6B tokens from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) + [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)
	- SFT: Multi-turn instruction tuning on 5 mixed datasets with Llama-3-style chat templates
	- Chat template: Llama-3 style — `<\|begin_of_text\|><\|start_header_id\|>user<\|end_header_id\|>\n\nHello<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>\n\n`

	---

	## Model Architecture

	Pipeline: `Chat Prompt` → `BPE-32K Tokenizer` → `LLaMA Decoder (12L)` → `Token Prediction`

	### Decoder Block (×12)

	Each transformer layer contains:
	- Grouped Query Attention with RoPE positional embeddings (12 Q heads, 4 KV heads)
	- SwiGLU MLP with gated activation (768 → 2048 → 768)
	- RMSNorm pre-attention and pre-MLP
	- SDPA backend (Flash Attention when available)

	### Technical Specifications

	<table>
	<tr><td><b>Architecture</b></td><td>LLaMA-style Decoder-Only Transformer</td></tr>
	<tr><td><b>Parameters</b></td><td>100,092,672 (~100M)</td></tr>
	<tr><td><b>Hidden Dimension</b></td><td>768</td></tr>
	<tr><td><b>Intermediate (MLP)</b></td><td>2,048</td></tr>
	<tr><td><b>Layers</b></td><td>12</td></tr>
	<tr><td><b>Attention Heads</b></td><td>12 (Q) / 4 (KV) — GQA 3:1</td></tr>
	<tr><td><b>Head Dimension</b></td><td>64</td></tr>
	<tr><td><b>Context Length</b></td><td>1024</td></tr>
	<tr><td><b>RoPE θ</b></td><td>10,000</td></tr>
	<tr><td><b>Vocabulary</b></td><td>32,000 (BPE)</td></tr>
	<tr><td><b>Tied Embeddings</b></td><td>Yes</td></tr>
	<tr><td><b>Precision</b></td><td>bfloat16</td></tr>
	<tr><td><b>Weight Size</b></td><td>~191 MiB (bf16)</td></tr>
	</table>

	### Design Choices

	<table>
	<tr><th>Feature</th><th>Description</th><th>Origin</th></tr>
	<tr><td><b>RoPE</b></td><td>Rotary Positional Embeddings for relative position encoding</td><td>LLaMA</td></tr>
	<tr><td><b>GQA</b></td><td>Grouped Query Attention (3:1) for efficient KV cache</td><td>LLaMA-2</td></tr>
	<tr><td><b>SwiGLU</b></td><td>Gated linear unit with SiLU activation</td><td>PaLM, LLaMA</td></tr>
	<tr><td><b>RMSNorm</b></td><td>Root Mean Square normalization (faster than LayerNorm)</td><td>LLaMA</td></tr>
	<tr><td><b>Flash Attention</b></td><td>Memory-efficient attention via PyTorch SDPA</td><td>Dao et al.</td></tr>
	<tr><td><b>Weight Tying</b></td><td>Embedding and LM head share weights</td><td>Standard</td></tr>
	</table>

	---

	## Tokenizer

	<table>
	<tr><td><b>Type</b></td><td>Byte-Pair Encoding (BPE)</td></tr>
	<tr><td><b>Vocabulary</b></td><td>32,000 tokens</td></tr>
	<tr><td><b>Library</b></td><td>HuggingFace <code>tokenizers</code></td></tr>
	</table>

	### Special Tokens

	<table>
	<tr><th>Token</th><th>ID</th><th>Purpose</th></tr>
	<tr><td><code><\|pad\|></code></td><td>0</td><td>Padding</td></tr>
	<tr><td><code><\|unk\|></code></td><td>1</td><td>Unknown</td></tr>
	<tr><td><code><\|begin_of_text\|></code></td><td>2</td><td>Beginning of text</td></tr>
	<tr><td><code><\|end_of_text\|></code></td><td>3</td><td>End of text (document boundary)</td></tr>
	<tr><td><code><\|start_header_id\|></code></td><td>4</td><td>Chat role header open</td></tr>
	<tr><td><code><\|end_header_id\|></code></td><td>5</td><td>Chat role header close</td></tr>
	<tr><td><code><\|eot_id\|></code></td><td>6</td><td>End of turn (generation stop token)</td></tr>
	</table>

	---

	## Training Details

	### Phase 1: Pretraining

	<table>
	<tr><td><b>Dataset</b></td><td><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">FineWeb-Edu</a> + <a href="https://huggingface.co/datasets/wikimedia/wikipedia">Wikipedia</a></td></tr>
	<tr><td><b>Tokens</b></td><td>~16.6B (~11.6B FineWeb-Edu + ~5B Wikipedia)</td></tr>
	<tr><td><b>Context Length</b></td><td>1024</td></tr>
	<tr><td><b>Objective</b></td><td>Next-token prediction (all tokens)</td></tr>
	<tr><td><b>Peak LR</b></td><td>3 × 10<sup>-4</sup></td></tr>
	<tr><td><b>Min LR</b></td><td>3 × 10<sup>-5</sup></td></tr>
	<tr><td><b>Warmup</b></td><td>200 steps</td></tr>
	<tr><td><b>Schedule</b></td><td>Warmup → Plateau (10%) → Cosine Decay</td></tr>
	</table>

	### Phase 2: Supervised Fine-Tuning (SFT)

	<table>
	<tr><td><b>Datasets</b></td><td>Kyoto-Corpus + LMSYS-Chat-1M + Nomi-150M-Chat + Chat-Compilation</td></tr>
	<tr><td><b>Context Length</b></td><td>1024</td></tr>
	<tr><td><b>Objective</b></td><td>Masked cross-entropy (assistant tokens only)</td></tr>
	<tr><td><b>Chat Template</b></td><td>Llama-3 style with header tokens</td></tr>
	<tr><td><b>Peak LR</b></td><td>5 × 10<sup>-5</sup></td></tr>
	<tr><td><b>Min LR</b></td><td>5 × 10<sup>-6</sup></td></tr>
	<tr><td><b>Warmup</b></td><td>100 steps</td></tr>
	<tr><td><b>Schedule</b></td><td>Warmup → Cosine Decay</td></tr>
	</table>

	### Shared Training Config

	<table>
	<tr><td><b>Optimizer</b></td><td>AdamW (fused) — β&sub1;=0.9, β&sub2;=0.95, ε=10<sup>-8</sup></td></tr>
	<tr><td><b>Weight Decay</b></td><td>0.0</td></tr>
	<tr><td><b>Gradient Clipping</b></td><td>1.0 (global norm)</td></tr>
	<tr><td><b>Precision</b></td><td>bfloat16 autocast</td></tr>
	<tr><td><b>Compilation</b></td><td>Optional <code>torch.compile</code> (max-autotune)</td></tr>
	<tr><td><b>Multi-GPU</b></td><td>Automatic DDP when ≥2 GPUs detected</td></tr>
	</table>

	### SFT Datasets

	<table>
	<tr><th>Dataset</th><th>Source</th><th>Notes</th></tr>
	<tr><td><b>Kyoto-Corpus</b></td><td><a href="https://huggingface.co/datasets/Nikity/Kyoto-Corpus">Nikity/Kyoto-Corpus</a></td><td>Multi-turn instruction pairs</td></tr>
	<tr><td><b>LMSYS-Chat-1M</b></td><td><a href="https://huggingface.co/datasets/lmsys/lmsys-chat-1m">lmsys/lmsys-chat-1m</a></td><td>Real-world conversations (redacted rows skipped)</td></tr>
	<tr><td><b>Nomi-150M-Chat</b></td><td><a href="https://huggingface.co/datasets/guus4324343/Nomi-150M-Chat">guus4324343/Nomi-150M-Chat</a></td><td>Synthetic chat data</td></tr>
	<tr><td><b>Chat-Compilation</b></td><td><a href="https://huggingface.co/datasets/aklein4/chat-compilation">aklein4/chat-compilation</a></td><td>Multi-source compilation (system-prompt conversations excluded)</td></tr>
	</table>

	---

	## Quick Start

	### Installation

	```bash
	pip install torch safetensors tokenizers huggingface_hub
	```

	### Run

	```bash
	wget https://huggingface.co/kerzgrr/monostich/resolve/main/inference.py
	python inference.py
	```

	The script downloads the model, tokenizer, and config from Hugging Face automatically (cached after first run).

	### Usage

	Interactive chat (default):

	```bash
	python inference.py
	```

	Single prompt:

	```bash
	python inference.py --prompt "What is the capital of France?"
	```

	Options:

	\| Flag \| Default \| Description \|
	\|------\|---------\|-------------\|
	\| `--prompt` \| None \| Single prompt (omit for interactive REPL) \|
	\| `--temperature` \| 0.28 \| Sampling temperature \|
	\| `--top-p` \| 0.95 \| Nucleus sampling threshold \|
	\| `--max-new-tokens` \| context max \| Max tokens to generate \|
	\| `--device` \| cuda \| Device (`cuda` or `cpu`) \|
	\| `--seed` \| 1234 \| Random seed \|

	---

	## Model Family

	<table>
	<tr><th>Model</th><th>Parameters</th><th>Context</th><th>Status</th></tr>
	<tr><td><b>Monostich</b></td><td>~100M</td><td>1024</td><td>Available</td></tr>
	<tr><td><b>Couplet</b></td><td>~200M</td><td>1024</td><td>Training</td></tr>
	</table>

	---

	## Limitations

	- Scale: At 100M parameters this model is a research prototype, not a production system

	---

	## File Contents

	```
	kerzgrr/monostich/
	README.md # This model card
	inference.py # Standalone inference script
	monostich.safetensors # Weights (bfloat16, SafeTensors)
	config.json # Model architecture config
	tokenizer.json # BPE tokenizer (HuggingFace format)
	tokenizer_config.json # Tokenizer metadata
	special_token_ids.json # Token ID mapping
	special_tokens_map.json # Token string mapping
	```

	---

	## Citation

	```bibtex
	@misc{monostich2026,
	title={Monostich: A Compact Instruction-Tuned Language Model},
	year={2026},
	url={https://huggingface.co/kerzgrr/monostich}
	}
	```

	---

	## Acknowledgments

	Built on:
	- LLaMA architecture (Meta AI)
	- FineWeb-Edu dataset (HuggingFace)
	- Wikipedia dataset (Wikimedia)
	- Kyoto-Corpus (Nikity)
	- LMSYS-Chat-1M (LMSYS)
	- Nomi-150M-Chat (guus4324343)
	- Chat-Compilation (aklein4)
	- PyTorch SDPA / Flash Attention
	- HuggingFace tokenizers and hub

	---

	<div align="center">

	A monostich is a poem of a single line — small, but complete.

	</div>