--- license: apache-2.0 language: - en tags: - text-generation - causal-lm - llama - transformer - pytorch - sft - instruction-tuned - flash-attention - gguf-compatible pipeline_tag: text-generation datasets: - HuggingFaceFW/fineweb-edu - wikimedia/wikipedia - Nikity/Kyoto-Corpus - lmsys/lmsys-chat-1m - guus4324343/Nomi-150M-Chat - aklein4/chat-compilation model-index: - name: Monostich-100M results: [] ---
# Monostich 100M ### A Compact Instruction-Tuned Language Model [![Model](https://img.shields.io/badge/Model-100M_params-blue)](.) [![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](LICENSE) [![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org) [![GGUF](https://img.shields.io/badge/GGUF-Compatible-orange.svg)](https://github.com/ggerganov/llama.cpp) *A from-scratch LLaMA-style language model pretrained on 16.6B tokens and instruction-tuned on multi-turn chat data*
--- ## Overview **Monostich** is a ~100M parameter decoder-only transformer trained entirely from scratch. It uses a LLaMA-compatible architecture with modern components (GQA, RoPE, SwiGLU, RMSNorm) and is designed to be lightweight. - **Pretraining**: ~16.6B tokens from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) + [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) - **SFT**: Multi-turn instruction tuning on 5 mixed datasets with Llama-3-style chat templates - **Chat template**: Llama-3 style — `<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n` --- ## Model Architecture **Pipeline:** `Chat Prompt` → `BPE-32K Tokenizer` → `LLaMA Decoder (12L)` → `Token Prediction` ### Decoder Block (×12) Each transformer layer contains: - **Grouped Query Attention** with RoPE positional embeddings (12 Q heads, 4 KV heads) - **SwiGLU MLP** with gated activation (768 → 2048 → 768) - **RMSNorm** pre-attention and pre-MLP - **SDPA** backend (Flash Attention when available) ### Technical Specifications
ArchitectureLLaMA-style Decoder-Only Transformer
Parameters100,092,672 (~100M)
Hidden Dimension768
Intermediate (MLP)2,048
Layers12
Attention Heads12 (Q) / 4 (KV) — GQA 3:1
Head Dimension64
Context Length1024
RoPE θ10,000
Vocabulary32,000 (BPE)
Tied EmbeddingsYes
Precisionbfloat16
Weight Size~191 MiB (bf16)
### Design Choices
FeatureDescriptionOrigin
RoPERotary Positional Embeddings for relative position encodingLLaMA
GQAGrouped Query Attention (3:1) for efficient KV cacheLLaMA-2
SwiGLUGated linear unit with SiLU activationPaLM, LLaMA
RMSNormRoot Mean Square normalization (faster than LayerNorm)LLaMA
Flash AttentionMemory-efficient attention via PyTorch SDPADao et al.
Weight TyingEmbedding and LM head share weightsStandard
--- ## Tokenizer
TypeByte-Pair Encoding (BPE)
Vocabulary32,000 tokens
LibraryHuggingFace tokenizers
### Special Tokens
TokenIDPurpose
<|pad|>0Padding
<|unk|>1Unknown
<|begin_of_text|>2Beginning of text
<|end_of_text|>3End of text (document boundary)
<|start_header_id|>4Chat role header open
<|end_header_id|>5Chat role header close
<|eot_id|>6End of turn (generation stop token)
--- ## Training Details ### Phase 1: Pretraining
DatasetFineWeb-Edu + Wikipedia
Tokens~16.6B (~11.6B FineWeb-Edu + ~5B Wikipedia)
Context Length1024
ObjectiveNext-token prediction (all tokens)
Peak LR3 × 10-4
Min LR3 × 10-5
Warmup200 steps
ScheduleWarmup → Plateau (10%) → Cosine Decay
### Phase 2: Supervised Fine-Tuning (SFT)
DatasetsKyoto-Corpus + LMSYS-Chat-1M + Nomi-150M-Chat + Chat-Compilation
Context Length1024
ObjectiveMasked cross-entropy (assistant tokens only)
Chat TemplateLlama-3 style with header tokens
Peak LR5 × 10-5
Min LR5 × 10-6
Warmup100 steps
ScheduleWarmup → Cosine Decay
### Shared Training Config
OptimizerAdamW (fused) — β&sub1;=0.9, β&sub2;=0.95, ε=10-8
Weight Decay0.0
Gradient Clipping1.0 (global norm)
Precisionbfloat16 autocast
CompilationOptional torch.compile (max-autotune)
Multi-GPUAutomatic DDP when ≥2 GPUs detected
### SFT Datasets
DatasetSourceNotes
Kyoto-CorpusNikity/Kyoto-CorpusMulti-turn instruction pairs
LMSYS-Chat-1Mlmsys/lmsys-chat-1mReal-world conversations (redacted rows skipped)
Nomi-150M-Chatguus4324343/Nomi-150M-ChatSynthetic chat data
Chat-Compilationaklein4/chat-compilationMulti-source compilation (system-prompt conversations excluded)
--- ## Quick Start ### Installation ```bash pip install torch safetensors tokenizers huggingface_hub ``` ### Run ```bash wget https://huggingface.co/kerzgrr/monostich/resolve/main/inference.py python inference.py ``` The script downloads the model, tokenizer, and config from Hugging Face automatically (cached after first run). ### Usage **Interactive chat** (default): ```bash python inference.py ``` **Single prompt**: ```bash python inference.py --prompt "What is the capital of France?" ``` **Options:** | Flag | Default | Description | |------|---------|-------------| | `--prompt` | None | Single prompt (omit for interactive REPL) | | `--temperature` | 0.28 | Sampling temperature | | `--top-p` | 0.95 | Nucleus sampling threshold | | `--max-new-tokens` | context max | Max tokens to generate | | `--device` | cuda | Device (`cuda` or `cpu`) | | `--seed` | 1234 | Random seed | --- ## Model Family
ModelParametersContextStatus
Monostich~100M1024Available
Couplet~200M1024Training
--- ## Limitations - **Scale**: At 100M parameters this model is a research prototype, not a production system --- ## File Contents ``` kerzgrr/monostich/ README.md # This model card inference.py # Standalone inference script monostich.safetensors # Weights (bfloat16, SafeTensors) config.json # Model architecture config tokenizer.json # BPE tokenizer (HuggingFace format) tokenizer_config.json # Tokenizer metadata special_token_ids.json # Token ID mapping special_tokens_map.json # Token string mapping ``` --- ## Citation ```bibtex @misc{monostich2026, title={Monostich: A Compact Instruction-Tuned Language Model}, year={2026}, url={https://huggingface.co/kerzgrr/monostich} } ``` --- ## Acknowledgments Built on: - **LLaMA** architecture (Meta AI) - **FineWeb-Edu** dataset (HuggingFace) - **Wikipedia** dataset (Wikimedia) - **Kyoto-Corpus** (Nikity) - **LMSYS-Chat-1M** (LMSYS) - **Nomi-150M-Chat** (guus4324343) - **Chat-Compilation** (aklein4) - **PyTorch** SDPA / Flash Attention - **HuggingFace** tokenizers and hub ---
*A monostich is a poem of a single line — small, but complete.*