ereniko's picture
Update README.md
97dd11c verified
---
license: apache-2.0
language:
- en
tags:
- language-model
- transformer
- rope
- swiglu
- gqa
- muon
- from-scratch
- tiny
- small
- decoder-only
datasets:
- epfml/FineWeb-HQ
- HuggingFaceTB/cosmopedia
- HuggingFaceTB/finemath
- bigcode/python-stack-v1-functions-filtered
- wikimedia/wikipedia
pipeline_tag: text-generation
---
# İvme-Conversate-22M-Base
![Conversate-22M Logo](https://cdn-uploads.huggingface.co/production/uploads/670562d6ac129959c16f84d4/Gi8oMz-Q8n2CImbtVyHOy.png)
**İvme** (Turkish: *acceleration*) is a series of stupidly small language models built to punch above their weight. This is the first release: a 22M parameter decoder-only base model trained from scratch on a dense, quality-filtered corpus.
The goal is not production deployment. The goal is to see how well a sub-25M model can perform when every decision — architecture, data mix, optimizer, training schedule — is made deliberately.
---
## Model Details
| Parameter | Value |
|---|---|
| Architecture | Decoder-only transformer |
| Parameters | 22,028,160 |
| Layers | 10 |
| Hidden dim | 384 |
| FFN dim | 1024 (SwiGLU) |
| Attention heads | 6 query / 2 KV (GQA) |
| Context length | 1024 tokens |
| Vocab size | 16,384 (custom BPE) |
| Positional encoding | RoPE (θ=10,000) |
| Normalization | RMSNorm (pre-norm) |
| Embeddings | Tied input/output |
| Biases | None |
---
## Benchmarks
All benchmarks run via [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), 0-shot. WikiText-2 uses byte_perplexity for tokenizer-independent comparison.
| Benchmark | Score | Notes |
|---|---|---|
| WikiText-2 (byte_perplexity) ↓ | **2.96** | Lower is better |
| BLiMP ↑ | **61.40%** | Average over 67 subtasks; random baseline 50% |
| ARC-Easy ↑ | **30.85%** | acc_norm, 0-shot |
---
## Training
### Data Mix (~1.57B tokens, Chinchilla-optimal)
Data is ordered in ascending quality for curriculum learning — the model sees noisier web text first and the densest material last.
| Source | Tokens | Share |
|---|---|---|
| epfml/FineWeb-HQ (score > 0.8) | ~710M | 45% |
| bigcode/python-stack-v1-functions-filtered | ~160M | 10% |
| HuggingFaceTB/finemath (finemath-4plus) | ~235M | 15% |
| HuggingFaceTB/cosmopedia (stanford + wikihow) | ~395M | 25% |
| wikimedia/wikipedia (EN, 20231101) | ~80M | 5% |
### Hyperparameters
| Setting | Value |
|---|---|
| Optimizer | Muon (body weights) + AdamW (embeddings, norms) |
| Muon lr | 0.02 |
| AdamW lr | 3e-4 |
| LR schedule | Warmup-Stable-Decay (WSD) |
| Warmup steps | 100 |
| Decay fraction | 20% of training |
| Weight decay | 0.1 |
| Gradient clipping | 1.0 |
| Effective batch | ~1.05M tokens/step |
| Total steps | 1,447 |
| Precision | bfloat16 |
| Attention | Flash Attention 2 (HF Kernels) |
| Final weights | EMA (β=0.999) of training trajectory |
### Hardware
Trained on a single NVIDIA RTX PRO 6000 Blackwell (96GB) in approximately **20 minutes**.
---
## Tokenizer
Custom BPE tokenizer trained from scratch on a balanced sample of the pretraining corpus. Vocab size 16,384 with ByteLevel pre-tokenization.
Special tokens: `<|pad|>`, `<|bos|>`, `<|eos|>`, `<|unk|>`, `<|user|>`, `<|assistant|>`, `<|system|>`
---
## Usage
```python
import torch
from tokenizers import Tokenizer
# Load with custom code (not a standard HF AutoModel — see model.py)
from model import IvmeConfig, IvmeConversate
tokenizer = Tokenizer.from_file("ivme_tokenizer.json")
ckpt = torch.load("ivme_base_ema.pt", map_location="cuda", weights_only=False)
cfg = ckpt["cfg"]
cfg.attn_backend = "sdpa" # or "kernels" for HF Kernels flash-attn
model = IvmeConversate(cfg).cuda()
model.load_state_dict(ckpt["model"])
model.eval()
prompt = "The theory of relativity states that"
ids = torch.tensor([tokenizer.encode(prompt).ids], device="cuda")
out = model.generate(ids, max_new_tokens=100, temperature=0.8, top_k=40)
print(tokenizer.decode(out[0].tolist()))
```
---
## Limitations
- Base model only — not instruction tuned, will not follow instructions or answer questions
- English only (v1)
- Limited factual knowledge due to Chinchilla-optimal training (1.57B tokens)
- Repetition at higher temperatures without `repetition_penalty`
- 1024 token context window
---
## What's Next
- **İvme-Conversate-22M-Instruct** — SFT on smol-smoltalk for instruction following
- **İvme-Conversate-v2** — extended training (~15B tokens), reordered curriculum
- **Turkish support** — v2 will add EN+TR with a dedicated bilingual tokenizer
- **İvme-Classify** — encoder-only series for classification tasks
---
## Citation
```bibtex
@misc{ivme-conversate-22m,
author = {IvmeLabs},
title = {İvme-Conversate-22M-Base},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/IvmeLabs/Ivme-Conversate-22M-Base}
}
```
---
*Built by IvmeLabs. Small models, deliberate choices.*