File size: 4,933 Bytes
7628e0a
 
848dd53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7628e0a
848dd53
 
 
97dd11c
4155521
848dd53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
license: apache-2.0
language:
- en
tags:
- language-model
- transformer
- rope
- swiglu
- gqa
- muon
- from-scratch
- tiny
- small
- decoder-only
datasets:
- epfml/FineWeb-HQ
- HuggingFaceTB/cosmopedia
- HuggingFaceTB/finemath
- bigcode/python-stack-v1-functions-filtered
- wikimedia/wikipedia
pipeline_tag: text-generation
---

# İvme-Conversate-22M-Base

![Conversate-22M Logo](https://cdn-uploads.huggingface.co/production/uploads/670562d6ac129959c16f84d4/Gi8oMz-Q8n2CImbtVyHOy.png)

**İvme** (Turkish: *acceleration*) is a series of stupidly small language models built to punch above their weight. This is the first release: a 22M parameter decoder-only base model trained from scratch on a dense, quality-filtered corpus.

The goal is not production deployment. The goal is to see how well a sub-25M model can perform when every decision — architecture, data mix, optimizer, training schedule — is made deliberately.

---

## Model Details

| Parameter | Value |
|---|---|
| Architecture | Decoder-only transformer |
| Parameters | 22,028,160 |
| Layers | 10 |
| Hidden dim | 384 |
| FFN dim | 1024 (SwiGLU) |
| Attention heads | 6 query / 2 KV (GQA) |
| Context length | 1024 tokens |
| Vocab size | 16,384 (custom BPE) |
| Positional encoding | RoPE (θ=10,000) |
| Normalization | RMSNorm (pre-norm) |
| Embeddings | Tied input/output |
| Biases | None |

---

## Benchmarks

All benchmarks run via [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), 0-shot. WikiText-2 uses byte_perplexity for tokenizer-independent comparison.

| Benchmark | Score | Notes |
|---|---|---|
| WikiText-2 (byte_perplexity) ↓ | **2.96** | Lower is better |
| BLiMP ↑ | **61.40%** | Average over 67 subtasks; random baseline 50% |
| ARC-Easy ↑ | **30.85%** | acc_norm, 0-shot |

---

## Training

### Data Mix (~1.57B tokens, Chinchilla-optimal)

Data is ordered in ascending quality for curriculum learning — the model sees noisier web text first and the densest material last.

| Source | Tokens | Share |
|---|---|---|
| epfml/FineWeb-HQ (score > 0.8) | ~710M | 45% |
| bigcode/python-stack-v1-functions-filtered | ~160M | 10% |
| HuggingFaceTB/finemath (finemath-4plus) | ~235M | 15% |
| HuggingFaceTB/cosmopedia (stanford + wikihow) | ~395M | 25% |
| wikimedia/wikipedia (EN, 20231101) | ~80M | 5% |

### Hyperparameters

| Setting | Value |
|---|---|
| Optimizer | Muon (body weights) + AdamW (embeddings, norms) |
| Muon lr | 0.02 |
| AdamW lr | 3e-4 |
| LR schedule | Warmup-Stable-Decay (WSD) |
| Warmup steps | 100 |
| Decay fraction | 20% of training |
| Weight decay | 0.1 |
| Gradient clipping | 1.0 |
| Effective batch | ~1.05M tokens/step |
| Total steps | 1,447 |
| Precision | bfloat16 |
| Attention | Flash Attention 2 (HF Kernels) |
| Final weights | EMA (β=0.999) of training trajectory |

### Hardware

Trained on a single NVIDIA RTX PRO 6000 Blackwell (96GB) in approximately **20 minutes**.

---

## Tokenizer

Custom BPE tokenizer trained from scratch on a balanced sample of the pretraining corpus. Vocab size 16,384 with ByteLevel pre-tokenization.

Special tokens: `<|pad|>`, `<|bos|>`, `<|eos|>`, `<|unk|>`, `<|user|>`, `<|assistant|>`, `<|system|>`

---

## Usage

```python
import torch
from tokenizers import Tokenizer

# Load with custom code (not a standard HF AutoModel — see model.py)
from model import IvmeConfig, IvmeConversate

tokenizer = Tokenizer.from_file("ivme_tokenizer.json")
ckpt = torch.load("ivme_base_ema.pt", map_location="cuda", weights_only=False)
cfg = ckpt["cfg"]
cfg.attn_backend = "sdpa"  # or "kernels" for HF Kernels flash-attn
model = IvmeConversate(cfg).cuda()
model.load_state_dict(ckpt["model"])
model.eval()

prompt = "The theory of relativity states that"
ids = torch.tensor([tokenizer.encode(prompt).ids], device="cuda")
out = model.generate(ids, max_new_tokens=100, temperature=0.8, top_k=40)
print(tokenizer.decode(out[0].tolist()))
```

---

## Limitations

- Base model only — not instruction tuned, will not follow instructions or answer questions
- English only (v1)
- Limited factual knowledge due to Chinchilla-optimal training (1.57B tokens)
- Repetition at higher temperatures without `repetition_penalty`
- 1024 token context window

---

## What's Next

- **İvme-Conversate-22M-Instruct** — SFT on smol-smoltalk for instruction following
- **İvme-Conversate-v2** — extended training (~15B tokens), reordered curriculum
- **Turkish support** — v2 will add EN+TR with a dedicated bilingual tokenizer
- **İvme-Classify** — encoder-only series for classification tasks

---

## Citation

```bibtex
@misc{ivme-conversate-22m,
  author       = {IvmeLabs},
  title        = {İvme-Conversate-22M-Base},
  year         = {2026},
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/IvmeLabs/Ivme-Conversate-22M-Base}
}
```

---

*Built by IvmeLabs. Small models, deliberate choices.*