IvmeLabs
/

Ivme-Conversate-22M-Base

 ---
 license: apache-2.0
+language:
+- en
+tags:
+- language-model
+- transformer
+- rope
+- swiglu
+- gqa
+- muon
+- from-scratch
+- tiny
+- small
+- decoder-only
+datasets:
+- epfml/FineWeb-HQ
+- HuggingFaceTB/cosmopedia
+- HuggingFaceTB/finemath
+- bigcode/python-stack-v1-functions-filtered
+- wikimedia/wikipedia
+pipeline_tag: text-generation
 ---
+# İvme-Conversate-22M-Base
+**İvme** (Turkish: *acceleration*) is a series of stupidly small language models built to punch above their weight. This is the first release: a 22M parameter decoder-only base model trained from scratch on a dense, quality-filtered corpus.
+The goal is not production deployment. The goal is to see how well a sub-25M model can perform when every decision — architecture, data mix, optimizer, training schedule — is made deliberately.
+---
+## Model Details
+| Parameter | Value |
+|---|---|
+| Architecture | Decoder-only transformer |
+| Parameters | 22,028,160 |
+| Layers | 10 |
+| Hidden dim | 384 |
+| FFN dim | 1024 (SwiGLU) |
+| Attention heads | 6 query / 2 KV (GQA) |
+| Context length | 1024 tokens |
+| Vocab size | 16,384 (custom BPE) |
+| Positional encoding | RoPE (θ=10,000) |
+| Normalization | RMSNorm (pre-norm) |
+| Embeddings | Tied input/output |
+| Biases | None |
+---
+## Benchmarks
+All benchmarks run via [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), 0-shot. WikiText-2 uses byte_perplexity for tokenizer-independent comparison.
+| Benchmark | Score | Notes |
+|---|---|---|
+| WikiText-2 (byte_perplexity) ↓ | **2.96** | Lower is better |
+| BLiMP ↑ | **61.40%** | Average over 67 subtasks; random baseline 50% |
+| ARC-Easy ↑ | **30.85%** | acc_norm, 0-shot |
+---
+## Training
+### Data Mix (~1.57B tokens, Chinchilla-optimal)
+Data is ordered in ascending quality for curriculum learning — the model sees noisier web text first and the densest material last.
+| Source | Tokens | Share |
+|---|---|---|
+| epfml/FineWeb-HQ (score > 0.8) | ~710M | 45% |
+| bigcode/python-stack-v1-functions-filtered | ~160M | 10% |
+| HuggingFaceTB/finemath (finemath-4plus) | ~235M | 15% |
+| HuggingFaceTB/cosmopedia (stanford + wikihow) | ~395M | 25% |
+| wikimedia/wikipedia (EN, 20231101) | ~80M | 5% |
+### Hyperparameters
+| Setting | Value |
+|---|---|
+| Optimizer | Muon (body weights) + AdamW (embeddings, norms) |
+| Muon lr | 0.02 |
+| AdamW lr | 3e-4 |
+| LR schedule | Warmup-Stable-Decay (WSD) |
+| Warmup steps | 100 |
+| Decay fraction | 20% of training |
+| Weight decay | 0.1 |
+| Gradient clipping | 1.0 |
+| Effective batch | ~1.05M tokens/step |
+| Total steps | 1,447 |
+| Precision | bfloat16 |
+| Attention | Flash Attention 2 (HF Kernels) |
+| Final weights | EMA (β=0.999) of training trajectory |
+### Hardware
+Trained on a single NVIDIA RTX PRO 6000 Blackwell (96GB) in approximately **20 minutes**.
+---
+## Tokenizer
+Custom BPE tokenizer trained from scratch on a balanced sample of the pretraining corpus. Vocab size 16,384 with ByteLevel pre-tokenization.
+Special tokens: `<|pad|>`, `<|bos|>`, `<|eos|>`, `<|unk|>`, `<|user|>`, `<|assistant|>`, `<|system|>`
+---
+## Usage
+```python
+import torch
+from tokenizers import Tokenizer
+# Load with custom code (not a standard HF AutoModel — see model.py)
+from model import IvmeConfig, IvmeConversate
+tokenizer = Tokenizer.from_file("ivme_tokenizer.json")
+ckpt = torch.load("ivme_base_ema.pt", map_location="cuda", weights_only=False)
+cfg = ckpt["cfg"]
+cfg.attn_backend = "sdpa"  # or "kernels" for HF Kernels flash-attn
+model = IvmeConversate(cfg).cuda()
+model.load_state_dict(ckpt["model"])
+model.eval()
+prompt = "The theory of relativity states that"
+ids = torch.tensor([tokenizer.encode(prompt).ids], device="cuda")
+out = model.generate(ids, max_new_tokens=100, temperature=0.8, top_k=40)
+print(tokenizer.decode(out[0].tolist()))
+```
+---
+## Limitations
+- Base model only — not instruction tuned, will not follow instructions or answer questions
+- English only (v1)
+- Limited factual knowledge due to Chinchilla-optimal training (1.57B tokens)
+- Repetition at higher temperatures without `repetition_penalty`
+- 1024 token context window
+---
+## What's Next
+- **İvme-Conversate-22M-Instruct** — SFT on smol-smoltalk for instruction following
+- **İvme-Conversate-v2** — extended training (~15B tokens), reordered curriculum
+- **Turkish support** — v2 will add EN+TR with a dedicated bilingual tokenizer
+- **İvme-Classify** — encoder-only series for classification tasks
+---
+## Citation
+```bibtex
+@misc{ivme-conversate-22m,
+  author       = {IvmeLabs},
+  title        = {İvme-Conversate-22M-Base},
+  year         = {2026},
+  publisher    = {Hugging Face},
+  url          = {https://huggingface.co/IvmeLabs/Ivme-Conversate-22M-Base}
+}
+```
+---
+*Built by IvmeLabs. Small models, deliberate choices.*