ereniko commited on
Commit
848dd53
·
verified ·
1 Parent(s): d93e05d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +165 -0
README.md CHANGED
@@ -1,3 +1,168 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - language-model
7
+ - transformer
8
+ - rope
9
+ - swiglu
10
+ - gqa
11
+ - muon
12
+ - from-scratch
13
+ - tiny
14
+ - small
15
+ - decoder-only
16
+ datasets:
17
+ - epfml/FineWeb-HQ
18
+ - HuggingFaceTB/cosmopedia
19
+ - HuggingFaceTB/finemath
20
+ - bigcode/python-stack-v1-functions-filtered
21
+ - wikimedia/wikipedia
22
+ pipeline_tag: text-generation
23
  ---
24
+
25
+ # İvme-Conversate-22M-Base
26
+
27
+ **İvme** (Turkish: *acceleration*) is a series of stupidly small language models built to punch above their weight. This is the first release: a 22M parameter decoder-only base model trained from scratch on a dense, quality-filtered corpus.
28
+
29
+ The goal is not production deployment. The goal is to see how well a sub-25M model can perform when every decision — architecture, data mix, optimizer, training schedule — is made deliberately.
30
+
31
+ ---
32
+
33
+ ## Model Details
34
+
35
+ | Parameter | Value |
36
+ |---|---|
37
+ | Architecture | Decoder-only transformer |
38
+ | Parameters | 22,028,160 |
39
+ | Layers | 10 |
40
+ | Hidden dim | 384 |
41
+ | FFN dim | 1024 (SwiGLU) |
42
+ | Attention heads | 6 query / 2 KV (GQA) |
43
+ | Context length | 1024 tokens |
44
+ | Vocab size | 16,384 (custom BPE) |
45
+ | Positional encoding | RoPE (θ=10,000) |
46
+ | Normalization | RMSNorm (pre-norm) |
47
+ | Embeddings | Tied input/output |
48
+ | Biases | None |
49
+
50
+ ---
51
+
52
+ ## Benchmarks
53
+
54
+ All benchmarks run via [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), 0-shot. WikiText-2 uses byte_perplexity for tokenizer-independent comparison.
55
+
56
+ | Benchmark | Score | Notes |
57
+ |---|---|---|
58
+ | WikiText-2 (byte_perplexity) ↓ | **2.96** | Lower is better |
59
+ | BLiMP ↑ | **61.40%** | Average over 67 subtasks; random baseline 50% |
60
+ | ARC-Easy ↑ | **30.85%** | acc_norm, 0-shot |
61
+
62
+ ---
63
+
64
+ ## Training
65
+
66
+ ### Data Mix (~1.57B tokens, Chinchilla-optimal)
67
+
68
+ Data is ordered in ascending quality for curriculum learning — the model sees noisier web text first and the densest material last.
69
+
70
+ | Source | Tokens | Share |
71
+ |---|---|---|
72
+ | epfml/FineWeb-HQ (score > 0.8) | ~710M | 45% |
73
+ | bigcode/python-stack-v1-functions-filtered | ~160M | 10% |
74
+ | HuggingFaceTB/finemath (finemath-4plus) | ~235M | 15% |
75
+ | HuggingFaceTB/cosmopedia (stanford + wikihow) | ~395M | 25% |
76
+ | wikimedia/wikipedia (EN, 20231101) | ~80M | 5% |
77
+
78
+ ### Hyperparameters
79
+
80
+ | Setting | Value |
81
+ |---|---|
82
+ | Optimizer | Muon (body weights) + AdamW (embeddings, norms) |
83
+ | Muon lr | 0.02 |
84
+ | AdamW lr | 3e-4 |
85
+ | LR schedule | Warmup-Stable-Decay (WSD) |
86
+ | Warmup steps | 100 |
87
+ | Decay fraction | 20% of training |
88
+ | Weight decay | 0.1 |
89
+ | Gradient clipping | 1.0 |
90
+ | Effective batch | ~1.05M tokens/step |
91
+ | Total steps | 1,447 |
92
+ | Precision | bfloat16 |
93
+ | Attention | Flash Attention 2 (HF Kernels) |
94
+ | Final weights | EMA (β=0.999) of training trajectory |
95
+
96
+ ### Hardware
97
+
98
+ Trained on a single NVIDIA RTX PRO 6000 Blackwell (96GB) in approximately **20 minutes**.
99
+
100
+ ---
101
+
102
+ ## Tokenizer
103
+
104
+ Custom BPE tokenizer trained from scratch on a balanced sample of the pretraining corpus. Vocab size 16,384 with ByteLevel pre-tokenization.
105
+
106
+ Special tokens: `<|pad|>`, `<|bos|>`, `<|eos|>`, `<|unk|>`, `<|user|>`, `<|assistant|>`, `<|system|>`
107
+
108
+ ---
109
+
110
+ ## Usage
111
+
112
+ ```python
113
+ import torch
114
+ from tokenizers import Tokenizer
115
+
116
+ # Load with custom code (not a standard HF AutoModel — see model.py)
117
+ from model import IvmeConfig, IvmeConversate
118
+
119
+ tokenizer = Tokenizer.from_file("ivme_tokenizer.json")
120
+ ckpt = torch.load("ivme_base_ema.pt", map_location="cuda", weights_only=False)
121
+ cfg = ckpt["cfg"]
122
+ cfg.attn_backend = "sdpa" # or "kernels" for HF Kernels flash-attn
123
+ model = IvmeConversate(cfg).cuda()
124
+ model.load_state_dict(ckpt["model"])
125
+ model.eval()
126
+
127
+ prompt = "The theory of relativity states that"
128
+ ids = torch.tensor([tokenizer.encode(prompt).ids], device="cuda")
129
+ out = model.generate(ids, max_new_tokens=100, temperature=0.8, top_k=40)
130
+ print(tokenizer.decode(out[0].tolist()))
131
+ ```
132
+
133
+ ---
134
+
135
+ ## Limitations
136
+
137
+ - Base model only — not instruction tuned, will not follow instructions or answer questions
138
+ - English only (v1)
139
+ - Limited factual knowledge due to Chinchilla-optimal training (1.57B tokens)
140
+ - Repetition at higher temperatures without `repetition_penalty`
141
+ - 1024 token context window
142
+
143
+ ---
144
+
145
+ ## What's Next
146
+
147
+ - **İvme-Conversate-22M-Instruct** — SFT on smol-smoltalk for instruction following
148
+ - **İvme-Conversate-v2** — extended training (~15B tokens), reordered curriculum
149
+ - **Turkish support** — v2 will add EN+TR with a dedicated bilingual tokenizer
150
+ - **İvme-Classify** — encoder-only series for classification tasks
151
+
152
+ ---
153
+
154
+ ## Citation
155
+
156
+ ```bibtex
157
+ @misc{ivme-conversate-22m,
158
+ author = {IvmeLabs},
159
+ title = {İvme-Conversate-22M-Base},
160
+ year = {2026},
161
+ publisher = {Hugging Face},
162
+ url = {https://huggingface.co/IvmeLabs/Ivme-Conversate-22M-Base}
163
+ }
164
+ ```
165
+
166
+ ---
167
+
168
+ *Built by IvmeLabs. Small models, deliberate choices.*