File size: 4,933 Bytes
7628e0a 848dd53 7628e0a 848dd53 97dd11c 4155521 848dd53 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | ---
license: apache-2.0
language:
- en
tags:
- language-model
- transformer
- rope
- swiglu
- gqa
- muon
- from-scratch
- tiny
- small
- decoder-only
datasets:
- epfml/FineWeb-HQ
- HuggingFaceTB/cosmopedia
- HuggingFaceTB/finemath
- bigcode/python-stack-v1-functions-filtered
- wikimedia/wikipedia
pipeline_tag: text-generation
---
# İvme-Conversate-22M-Base

**İvme** (Turkish: *acceleration*) is a series of stupidly small language models built to punch above their weight. This is the first release: a 22M parameter decoder-only base model trained from scratch on a dense, quality-filtered corpus.
The goal is not production deployment. The goal is to see how well a sub-25M model can perform when every decision — architecture, data mix, optimizer, training schedule — is made deliberately.
---
## Model Details
| Parameter | Value |
|---|---|
| Architecture | Decoder-only transformer |
| Parameters | 22,028,160 |
| Layers | 10 |
| Hidden dim | 384 |
| FFN dim | 1024 (SwiGLU) |
| Attention heads | 6 query / 2 KV (GQA) |
| Context length | 1024 tokens |
| Vocab size | 16,384 (custom BPE) |
| Positional encoding | RoPE (θ=10,000) |
| Normalization | RMSNorm (pre-norm) |
| Embeddings | Tied input/output |
| Biases | None |
---
## Benchmarks
All benchmarks run via [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), 0-shot. WikiText-2 uses byte_perplexity for tokenizer-independent comparison.
| Benchmark | Score | Notes |
|---|---|---|
| WikiText-2 (byte_perplexity) ↓ | **2.96** | Lower is better |
| BLiMP ↑ | **61.40%** | Average over 67 subtasks; random baseline 50% |
| ARC-Easy ↑ | **30.85%** | acc_norm, 0-shot |
---
## Training
### Data Mix (~1.57B tokens, Chinchilla-optimal)
Data is ordered in ascending quality for curriculum learning — the model sees noisier web text first and the densest material last.
| Source | Tokens | Share |
|---|---|---|
| epfml/FineWeb-HQ (score > 0.8) | ~710M | 45% |
| bigcode/python-stack-v1-functions-filtered | ~160M | 10% |
| HuggingFaceTB/finemath (finemath-4plus) | ~235M | 15% |
| HuggingFaceTB/cosmopedia (stanford + wikihow) | ~395M | 25% |
| wikimedia/wikipedia (EN, 20231101) | ~80M | 5% |
### Hyperparameters
| Setting | Value |
|---|---|
| Optimizer | Muon (body weights) + AdamW (embeddings, norms) |
| Muon lr | 0.02 |
| AdamW lr | 3e-4 |
| LR schedule | Warmup-Stable-Decay (WSD) |
| Warmup steps | 100 |
| Decay fraction | 20% of training |
| Weight decay | 0.1 |
| Gradient clipping | 1.0 |
| Effective batch | ~1.05M tokens/step |
| Total steps | 1,447 |
| Precision | bfloat16 |
| Attention | Flash Attention 2 (HF Kernels) |
| Final weights | EMA (β=0.999) of training trajectory |
### Hardware
Trained on a single NVIDIA RTX PRO 6000 Blackwell (96GB) in approximately **20 minutes**.
---
## Tokenizer
Custom BPE tokenizer trained from scratch on a balanced sample of the pretraining corpus. Vocab size 16,384 with ByteLevel pre-tokenization.
Special tokens: `<|pad|>`, `<|bos|>`, `<|eos|>`, `<|unk|>`, `<|user|>`, `<|assistant|>`, `<|system|>`
---
## Usage
```python
import torch
from tokenizers import Tokenizer
# Load with custom code (not a standard HF AutoModel — see model.py)
from model import IvmeConfig, IvmeConversate
tokenizer = Tokenizer.from_file("ivme_tokenizer.json")
ckpt = torch.load("ivme_base_ema.pt", map_location="cuda", weights_only=False)
cfg = ckpt["cfg"]
cfg.attn_backend = "sdpa" # or "kernels" for HF Kernels flash-attn
model = IvmeConversate(cfg).cuda()
model.load_state_dict(ckpt["model"])
model.eval()
prompt = "The theory of relativity states that"
ids = torch.tensor([tokenizer.encode(prompt).ids], device="cuda")
out = model.generate(ids, max_new_tokens=100, temperature=0.8, top_k=40)
print(tokenizer.decode(out[0].tolist()))
```
---
## Limitations
- Base model only — not instruction tuned, will not follow instructions or answer questions
- English only (v1)
- Limited factual knowledge due to Chinchilla-optimal training (1.57B tokens)
- Repetition at higher temperatures without `repetition_penalty`
- 1024 token context window
---
## What's Next
- **İvme-Conversate-22M-Instruct** — SFT on smol-smoltalk for instruction following
- **İvme-Conversate-v2** — extended training (~15B tokens), reordered curriculum
- **Turkish support** — v2 will add EN+TR with a dedicated bilingual tokenizer
- **İvme-Classify** — encoder-only series for classification tasks
---
## Citation
```bibtex
@misc{ivme-conversate-22m,
author = {IvmeLabs},
title = {İvme-Conversate-22M-Base},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/IvmeLabs/Ivme-Conversate-22M-Base}
}
```
---
*Built by IvmeLabs. Small models, deliberate choices.* |