--- license: apache-2.0 language: - en tags: - language-model - transformer - rope - swiglu - gqa - muon - from-scratch - tiny - small - decoder-only datasets: - epfml/FineWeb-HQ - HuggingFaceTB/cosmopedia - HuggingFaceTB/finemath - bigcode/python-stack-v1-functions-filtered - wikimedia/wikipedia pipeline_tag: text-generation --- # İvme-Conversate-22M-Base ![Conversate-22M Logo](https://cdn-uploads.huggingface.co/production/uploads/670562d6ac129959c16f84d4/Gi8oMz-Q8n2CImbtVyHOy.png) **İvme** (Turkish: *acceleration*) is a series of stupidly small language models built to punch above their weight. This is the first release: a 22M parameter decoder-only base model trained from scratch on a dense, quality-filtered corpus. The goal is not production deployment. The goal is to see how well a sub-25M model can perform when every decision — architecture, data mix, optimizer, training schedule — is made deliberately. --- ## Model Details | Parameter | Value | |---|---| | Architecture | Decoder-only transformer | | Parameters | 22,028,160 | | Layers | 10 | | Hidden dim | 384 | | FFN dim | 1024 (SwiGLU) | | Attention heads | 6 query / 2 KV (GQA) | | Context length | 1024 tokens | | Vocab size | 16,384 (custom BPE) | | Positional encoding | RoPE (θ=10,000) | | Normalization | RMSNorm (pre-norm) | | Embeddings | Tied input/output | | Biases | None | --- ## Benchmarks All benchmarks run via [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), 0-shot. WikiText-2 uses byte_perplexity for tokenizer-independent comparison. | Benchmark | Score | Notes | |---|---|---| | WikiText-2 (byte_perplexity) ↓ | **2.96** | Lower is better | | BLiMP ↑ | **61.40%** | Average over 67 subtasks; random baseline 50% | | ARC-Easy ↑ | **30.85%** | acc_norm, 0-shot | --- ## Training ### Data Mix (~1.57B tokens, Chinchilla-optimal) Data is ordered in ascending quality for curriculum learning — the model sees noisier web text first and the densest material last. | Source | Tokens | Share | |---|---|---| | epfml/FineWeb-HQ (score > 0.8) | ~710M | 45% | | bigcode/python-stack-v1-functions-filtered | ~160M | 10% | | HuggingFaceTB/finemath (finemath-4plus) | ~235M | 15% | | HuggingFaceTB/cosmopedia (stanford + wikihow) | ~395M | 25% | | wikimedia/wikipedia (EN, 20231101) | ~80M | 5% | ### Hyperparameters | Setting | Value | |---|---| | Optimizer | Muon (body weights) + AdamW (embeddings, norms) | | Muon lr | 0.02 | | AdamW lr | 3e-4 | | LR schedule | Warmup-Stable-Decay (WSD) | | Warmup steps | 100 | | Decay fraction | 20% of training | | Weight decay | 0.1 | | Gradient clipping | 1.0 | | Effective batch | ~1.05M tokens/step | | Total steps | 1,447 | | Precision | bfloat16 | | Attention | Flash Attention 2 (HF Kernels) | | Final weights | EMA (β=0.999) of training trajectory | ### Hardware Trained on a single NVIDIA RTX PRO 6000 Blackwell (96GB) in approximately **20 minutes**. --- ## Tokenizer Custom BPE tokenizer trained from scratch on a balanced sample of the pretraining corpus. Vocab size 16,384 with ByteLevel pre-tokenization. Special tokens: `<|pad|>`, `<|bos|>`, `<|eos|>`, `<|unk|>`, `<|user|>`, `<|assistant|>`, `<|system|>` --- ## Usage ```python import torch from tokenizers import Tokenizer # Load with custom code (not a standard HF AutoModel — see model.py) from model import IvmeConfig, IvmeConversate tokenizer = Tokenizer.from_file("ivme_tokenizer.json") ckpt = torch.load("ivme_base_ema.pt", map_location="cuda", weights_only=False) cfg = ckpt["cfg"] cfg.attn_backend = "sdpa" # or "kernels" for HF Kernels flash-attn model = IvmeConversate(cfg).cuda() model.load_state_dict(ckpt["model"]) model.eval() prompt = "The theory of relativity states that" ids = torch.tensor([tokenizer.encode(prompt).ids], device="cuda") out = model.generate(ids, max_new_tokens=100, temperature=0.8, top_k=40) print(tokenizer.decode(out[0].tolist())) ``` --- ## Limitations - Base model only — not instruction tuned, will not follow instructions or answer questions - English only (v1) - Limited factual knowledge due to Chinchilla-optimal training (1.57B tokens) - Repetition at higher temperatures without `repetition_penalty` - 1024 token context window --- ## What's Next - **İvme-Conversate-22M-Instruct** — SFT on smol-smoltalk for instruction following - **İvme-Conversate-v2** — extended training (~15B tokens), reordered curriculum - **Turkish support** — v2 will add EN+TR with a dedicated bilingual tokenizer - **İvme-Classify** — encoder-only series for classification tasks --- ## Citation ```bibtex @misc{ivme-conversate-22m, author = {IvmeLabs}, title = {İvme-Conversate-22M-Base}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/IvmeLabs/Ivme-Conversate-22M-Base} } ``` --- *Built by IvmeLabs. Small models, deliberate choices.*