Update README.md

97dd11c verified 3 days ago

4.93 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- language-model
	- transformer
	- rope
	- swiglu
	- gqa
	- muon
	- from-scratch
	- tiny
	- small
	- decoder-only
	datasets:
	- epfml/FineWeb-HQ
	- HuggingFaceTB/cosmopedia
	- HuggingFaceTB/finemath
	- bigcode/python-stack-v1-functions-filtered
	- wikimedia/wikipedia
	pipeline_tag: text-generation
	---

	# İvme-Conversate-22M-Base

	![Conversate-22M Logo](https://cdn-uploads.huggingface.co/production/uploads/670562d6ac129959c16f84d4/Gi8oMz-Q8n2CImbtVyHOy.png)

	İvme (Turkish: acceleration) is a series of stupidly small language models built to punch above their weight. This is the first release: a 22M parameter decoder-only base model trained from scratch on a dense, quality-filtered corpus.

	The goal is not production deployment. The goal is to see how well a sub-25M model can perform when every decision — architecture, data mix, optimizer, training schedule — is made deliberately.

	---

	## Model Details

	\| Parameter \| Value \|
	\|---\|---\|
	\| Architecture \| Decoder-only transformer \|
	\| Parameters \| 22,028,160 \|
	\| Layers \| 10 \|
	\| Hidden dim \| 384 \|
	\| FFN dim \| 1024 (SwiGLU) \|
	\| Attention heads \| 6 query / 2 KV (GQA) \|
	\| Context length \| 1024 tokens \|
	\| Vocab size \| 16,384 (custom BPE) \|
	\| Positional encoding \| RoPE (θ=10,000) \|
	\| Normalization \| RMSNorm (pre-norm) \|
	\| Embeddings \| Tied input/output \|
	\| Biases \| None \|

	---

	## Benchmarks

	All benchmarks run via [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), 0-shot. WikiText-2 uses byte_perplexity for tokenizer-independent comparison.

	\| Benchmark \| Score \| Notes \|
	\|---\|---\|---\|
	\| WikiText-2 (byte_perplexity) ↓ \| 2.96 \| Lower is better \|
	\| BLiMP ↑ \| 61.40% \| Average over 67 subtasks; random baseline 50% \|
	\| ARC-Easy ↑ \| 30.85% \| acc_norm, 0-shot \|

	---

	## Training

	### Data Mix (~1.57B tokens, Chinchilla-optimal)

	Data is ordered in ascending quality for curriculum learning — the model sees noisier web text first and the densest material last.

	\| Source \| Tokens \| Share \|
	\|---\|---\|---\|
	\| epfml/FineWeb-HQ (score > 0.8) \| ~710M \| 45% \|
	\| bigcode/python-stack-v1-functions-filtered \| ~160M \| 10% \|
	\| HuggingFaceTB/finemath (finemath-4plus) \| ~235M \| 15% \|
	\| HuggingFaceTB/cosmopedia (stanford + wikihow) \| ~395M \| 25% \|
	\| wikimedia/wikipedia (EN, 20231101) \| ~80M \| 5% \|

	### Hyperparameters

	\| Setting \| Value \|
	\|---\|---\|
	\| Optimizer \| Muon (body weights) + AdamW (embeddings, norms) \|
	\| Muon lr \| 0.02 \|
	\| AdamW lr \| 3e-4 \|
	\| LR schedule \| Warmup-Stable-Decay (WSD) \|
	\| Warmup steps \| 100 \|
	\| Decay fraction \| 20% of training \|
	\| Weight decay \| 0.1 \|
	\| Gradient clipping \| 1.0 \|
	\| Effective batch \| ~1.05M tokens/step \|
	\| Total steps \| 1,447 \|
	\| Precision \| bfloat16 \|
	\| Attention \| Flash Attention 2 (HF Kernels) \|
	\| Final weights \| EMA (β=0.999) of training trajectory \|

	### Hardware

	Trained on a single NVIDIA RTX PRO 6000 Blackwell (96GB) in approximately 20 minutes.

	---

	## Tokenizer

	Custom BPE tokenizer trained from scratch on a balanced sample of the pretraining corpus. Vocab size 16,384 with ByteLevel pre-tokenization.

	Special tokens: `<\|pad\|>`, `<\|bos\|>`, `<\|eos\|>`, `<\|unk\|>`, `<\|user\|>`, `<\|assistant\|>`, `<\|system\|>`

	---

	## Usage

	```python
	import torch
	from tokenizers import Tokenizer

	# Load with custom code (not a standard HF AutoModel — see model.py)
	from model import IvmeConfig, IvmeConversate

	tokenizer = Tokenizer.from_file("ivme_tokenizer.json")
	ckpt = torch.load("ivme_base_ema.pt", map_location="cuda", weights_only=False)
	cfg = ckpt["cfg"]
	cfg.attn_backend = "sdpa" # or "kernels" for HF Kernels flash-attn
	model = IvmeConversate(cfg).cuda()
	model.load_state_dict(ckpt["model"])
	model.eval()

	prompt = "The theory of relativity states that"
	ids = torch.tensor([tokenizer.encode(prompt).ids], device="cuda")
	out = model.generate(ids, max_new_tokens=100, temperature=0.8, top_k=40)
	print(tokenizer.decode(out[0].tolist()))
	```

	---

	## Limitations

	- Base model only — not instruction tuned, will not follow instructions or answer questions
	- English only (v1)
	- Limited factual knowledge due to Chinchilla-optimal training (1.57B tokens)
	- Repetition at higher temperatures without `repetition_penalty`
	- 1024 token context window

	---

	## What's Next

	- İvme-Conversate-22M-Instruct — SFT on smol-smoltalk for instruction following
	- İvme-Conversate-v2 — extended training (~15B tokens), reordered curriculum
	- Turkish support — v2 will add EN+TR with a dedicated bilingual tokenizer
	- İvme-Classify — encoder-only series for classification tasks

	---

	## Citation

	```bibtex
	@misc{ivme-conversate-22m,
	author = {IvmeLabs},
	title = {İvme-Conversate-22M-Base},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/IvmeLabs/Ivme-Conversate-22M-Base}
	}
	```

	---

	Built by IvmeLabs. Small models, deliberate choices.