Monostich 100M

A Compact Instruction-Tuned Language Model

Model License PyTorch GGUF

A from-scratch LLaMA-style language model pretrained on 16.6B tokens and instruction-tuned on multi-turn chat data


Overview

Monostich is a ~100M parameter decoder-only transformer trained entirely from scratch. It uses a LLaMA-compatible architecture with modern components (GQA, RoPE, SwiGLU, RMSNorm) and is designed to be lightweight.

  • Pretraining: ~16.6B tokens from FineWeb-Edu + Wikipedia
  • SFT: Multi-turn instruction tuning on 5 mixed datasets with Llama-3-style chat templates
  • Chat template: Llama-3 style — <|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n

Model Architecture

Pipeline: Chat PromptBPE-32K TokenizerLLaMA Decoder (12L)Token Prediction

Decoder Block (×12)

Each transformer layer contains:

  • Grouped Query Attention with RoPE positional embeddings (12 Q heads, 4 KV heads)
  • SwiGLU MLP with gated activation (768 → 2048 → 768)
  • RMSNorm pre-attention and pre-MLP
  • SDPA backend (Flash Attention when available)

Technical Specifications

ArchitectureLLaMA-style Decoder-Only Transformer
Parameters100,092,672 (~100M)
Hidden Dimension768
Intermediate (MLP)2,048
Layers12
Attention Heads12 (Q) / 4 (KV) — GQA 3:1
Head Dimension64
Context Length1024
RoPE θ10,000
Vocabulary32,000 (BPE)
Tied EmbeddingsYes
Precisionbfloat16
Weight Size~191 MiB (bf16)

Design Choices

FeatureDescriptionOrigin
RoPERotary Positional Embeddings for relative position encodingLLaMA
GQAGrouped Query Attention (3:1) for efficient KV cacheLLaMA-2
SwiGLUGated linear unit with SiLU activationPaLM, LLaMA
RMSNormRoot Mean Square normalization (faster than LayerNorm)LLaMA
Flash AttentionMemory-efficient attention via PyTorch SDPADao et al.
Weight TyingEmbedding and LM head share weightsStandard

Tokenizer

TypeByte-Pair Encoding (BPE)
Vocabulary32,000 tokens
LibraryHuggingFace tokenizers

Special Tokens

TokenIDPurpose
<|pad|>0Padding
<|unk|>1Unknown
<|begin_of_text|>2Beginning of text
<|end_of_text|>3End of text (document boundary)
<|start_header_id|>4Chat role header open
<|end_header_id|>5Chat role header close
<|eot_id|>6End of turn (generation stop token)

Training Details

Phase 1: Pretraining

DatasetFineWeb-Edu + Wikipedia
Tokens~16.6B (~11.6B FineWeb-Edu + ~5B Wikipedia)
Context Length1024
ObjectiveNext-token prediction (all tokens)
Peak LR3 × 10-4
Min LR3 × 10-5
Warmup200 steps
ScheduleWarmup → Plateau (10%) → Cosine Decay

Phase 2: Supervised Fine-Tuning (SFT)

DatasetsKyoto-Corpus + LMSYS-Chat-1M + Nomi-150M-Chat + Chat-Compilation
Context Length1024
ObjectiveMasked cross-entropy (assistant tokens only)
Chat TemplateLlama-3 style with header tokens
Peak LR5 × 10-5
Min LR5 × 10-6
Warmup100 steps
ScheduleWarmup → Cosine Decay

Shared Training Config

OptimizerAdamW (fused) — β&sub1;=0.9, β&sub2;=0.95, ε=10-8
Weight Decay0.0
Gradient Clipping1.0 (global norm)
Precisionbfloat16 autocast
CompilationOptional torch.compile (max-autotune)
Multi-GPUAutomatic DDP when ≥2 GPUs detected

SFT Datasets

DatasetSourceNotes
Kyoto-CorpusNikity/Kyoto-CorpusMulti-turn instruction pairs
LMSYS-Chat-1Mlmsys/lmsys-chat-1mReal-world conversations (redacted rows skipped)
Nomi-150M-Chatguus4324343/Nomi-150M-ChatSynthetic chat data
Chat-Compilationaklein4/chat-compilationMulti-source compilation (system-prompt conversations excluded)

Quick Start

Installation

pip install torch safetensors tokenizers huggingface_hub

Run

wget https://huggingface.co/kerzgrr/monostich/resolve/main/inference.py
python inference.py

The script downloads the model, tokenizer, and config from Hugging Face automatically (cached after first run).

Usage

Interactive chat (default):

python inference.py

Single prompt:

python inference.py --prompt "What is the capital of France?"

Options:

Flag Default Description
--prompt None Single prompt (omit for interactive REPL)
--temperature 0.28 Sampling temperature
--top-p 0.95 Nucleus sampling threshold
--max-new-tokens context max Max tokens to generate
--device cuda Device (cuda or cpu)
--seed 1234 Random seed

Model Family

ModelParametersContextStatus
Monostich~100M1024Available
Couplet~200M1024Training

Limitations

  • Scale: At 100M parameters this model is a research prototype, not a production system

File Contents

kerzgrr/monostich/
  README.md                # This model card
  inference.py             # Standalone inference script
  monostich.safetensors    # Weights (bfloat16, SafeTensors)
  config.json              # Model architecture config
  tokenizer.json           # BPE tokenizer (HuggingFace format)
  tokenizer_config.json    # Tokenizer metadata
  special_token_ids.json   # Token ID mapping
  special_tokens_map.json  # Token string mapping

Citation

@misc{monostich2026,
  title={Monostich: A Compact Instruction-Tuned Language Model},
  year={2026},
  url={https://huggingface.co/kerzgrr/monostich}
}

Acknowledgments

Built on:

  • LLaMA architecture (Meta AI)
  • FineWeb-Edu dataset (HuggingFace)
  • Wikipedia dataset (Wikimedia)
  • Kyoto-Corpus (Nikity)
  • LMSYS-Chat-1M (LMSYS)
  • Nomi-150M-Chat (guus4324343)
  • Chat-Compilation (aklein4)
  • PyTorch SDPA / Flash Attention
  • HuggingFace tokenizers and hub

A monostich is a poem of a single line — small, but complete.

Downloads last month
116
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train kerzgrr/Monostich