| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - language-model |
| - transformer |
| - rope |
| - swiglu |
| - gqa |
| - muon |
| - from-scratch |
| - tiny |
| - small |
| - decoder-only |
| datasets: |
| - epfml/FineWeb-HQ |
| - HuggingFaceTB/cosmopedia |
| - HuggingFaceTB/finemath |
| - bigcode/python-stack-v1-functions-filtered |
| - wikimedia/wikipedia |
| pipeline_tag: text-generation |
| --- |
| |
| # İvme-Conversate-22M-Base |
|
|
|  |
|
|
| **İvme** (Turkish: *acceleration*) is a series of stupidly small language models built to punch above their weight. This is the first release: a 22M parameter decoder-only base model trained from scratch on a dense, quality-filtered corpus. |
|
|
| The goal is not production deployment. The goal is to see how well a sub-25M model can perform when every decision — architecture, data mix, optimizer, training schedule — is made deliberately. |
|
|
| --- |
|
|
| ## Model Details |
|
|
| | Parameter | Value | |
| |---|---| |
| | Architecture | Decoder-only transformer | |
| | Parameters | 22,028,160 | |
| | Layers | 10 | |
| | Hidden dim | 384 | |
| | FFN dim | 1024 (SwiGLU) | |
| | Attention heads | 6 query / 2 KV (GQA) | |
| | Context length | 1024 tokens | |
| | Vocab size | 16,384 (custom BPE) | |
| | Positional encoding | RoPE (θ=10,000) | |
| | Normalization | RMSNorm (pre-norm) | |
| | Embeddings | Tied input/output | |
| | Biases | None | |
|
|
| --- |
|
|
| ## Benchmarks |
|
|
| All benchmarks run via [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), 0-shot. WikiText-2 uses byte_perplexity for tokenizer-independent comparison. |
| |
| | Benchmark | Score | Notes | |
| |---|---|---| |
| | WikiText-2 (byte_perplexity) ↓ | **2.96** | Lower is better | |
| | BLiMP ↑ | **61.40%** | Average over 67 subtasks; random baseline 50% | |
| | ARC-Easy ↑ | **30.85%** | acc_norm, 0-shot | |
| |
| --- |
| |
| ## Training |
| |
| ### Data Mix (~1.57B tokens, Chinchilla-optimal) |
| |
| Data is ordered in ascending quality for curriculum learning — the model sees noisier web text first and the densest material last. |
| |
| | Source | Tokens | Share | |
| |---|---|---| |
| | epfml/FineWeb-HQ (score > 0.8) | ~710M | 45% | |
| | bigcode/python-stack-v1-functions-filtered | ~160M | 10% | |
| | HuggingFaceTB/finemath (finemath-4plus) | ~235M | 15% | |
| | HuggingFaceTB/cosmopedia (stanford + wikihow) | ~395M | 25% | |
| | wikimedia/wikipedia (EN, 20231101) | ~80M | 5% | |
| |
| ### Hyperparameters |
| |
| | Setting | Value | |
| |---|---| |
| | Optimizer | Muon (body weights) + AdamW (embeddings, norms) | |
| | Muon lr | 0.02 | |
| | AdamW lr | 3e-4 | |
| | LR schedule | Warmup-Stable-Decay (WSD) | |
| | Warmup steps | 100 | |
| | Decay fraction | 20% of training | |
| | Weight decay | 0.1 | |
| | Gradient clipping | 1.0 | |
| | Effective batch | ~1.05M tokens/step | |
| | Total steps | 1,447 | |
| | Precision | bfloat16 | |
| | Attention | Flash Attention 2 (HF Kernels) | |
| | Final weights | EMA (β=0.999) of training trajectory | |
| |
| ### Hardware |
| |
| Trained on a single NVIDIA RTX PRO 6000 Blackwell (96GB) in approximately **20 minutes**. |
| |
| --- |
| |
| ## Tokenizer |
| |
| Custom BPE tokenizer trained from scratch on a balanced sample of the pretraining corpus. Vocab size 16,384 with ByteLevel pre-tokenization. |
| |
| Special tokens: `<|pad|>`, `<|bos|>`, `<|eos|>`, `<|unk|>`, `<|user|>`, `<|assistant|>`, `<|system|>` |
| |
| --- |
| |
| ## Usage |
| |
| ```python |
| import torch |
| from tokenizers import Tokenizer |
| |
| # Load with custom code (not a standard HF AutoModel — see model.py) |
| from model import IvmeConfig, IvmeConversate |
| |
| tokenizer = Tokenizer.from_file("ivme_tokenizer.json") |
| ckpt = torch.load("ivme_base_ema.pt", map_location="cuda", weights_only=False) |
| cfg = ckpt["cfg"] |
| cfg.attn_backend = "sdpa" # or "kernels" for HF Kernels flash-attn |
| model = IvmeConversate(cfg).cuda() |
| model.load_state_dict(ckpt["model"]) |
| model.eval() |
|
|
| prompt = "The theory of relativity states that" |
| ids = torch.tensor([tokenizer.encode(prompt).ids], device="cuda") |
| out = model.generate(ids, max_new_tokens=100, temperature=0.8, top_k=40) |
| print(tokenizer.decode(out[0].tolist())) |
| ``` |
| |
| --- |
| |
| ## Limitations |
| |
| - Base model only — not instruction tuned, will not follow instructions or answer questions |
| - English only (v1) |
| - Limited factual knowledge due to Chinchilla-optimal training (1.57B tokens) |
| - Repetition at higher temperatures without `repetition_penalty` |
| - 1024 token context window |
|
|
| --- |
|
|
| ## What's Next |
|
|
| - **İvme-Conversate-22M-Instruct** — SFT on smol-smoltalk for instruction following |
| - **İvme-Conversate-v2** — extended training (~15B tokens), reordered curriculum |
| - **Turkish support** — v2 will add EN+TR with a dedicated bilingual tokenizer |
| - **İvme-Classify** — encoder-only series for classification tasks |
|
|
| --- |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{ivme-conversate-22m, |
| author = {IvmeLabs}, |
| title = {İvme-Conversate-22M-Base}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| url = {https://huggingface.co/IvmeLabs/Ivme-Conversate-22M-Base} |
| } |
| ``` |
|
|
| --- |
|
|
| *Built by IvmeLabs. Small models, deliberate choices.* |