Promote run8 EN-only to main; remove bilingual run7 artifacts

e50280a verified 16 days ago

2.65 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- keyboard
	- language-model
	- mobile
	- ios
	- coreml
	- english
	library_name: pytorch
	pipeline_tag: text-generation
	---

	# Onit Keyboard LM (EN-only, run8)

	A 40M parameter English-only language model designed for next-word prediction
	in the Onit iOS keyboard. Replaces the previous bilingual run7 model that
	surfaced French tokens (`même`, `soit`, `des`, `présent`) in EN keyboard
	contexts.

	## Architecture

	\| Component \| Value \|
	\|-----------------------\|-----------------------------\|
	\| Type \| Causal LM (decoder-only) \|
	\| Parameters \| ~40M \|
	\| Vocabulary \| 16,384 BPE tokens (EN-only) \|
	\| Embedding dim \| 512 \|
	\| Layers \| 10 \|
	\| Attention heads \| 8 \|
	\| FFN dim \| 1408 (SwiGLU) \|
	\| Max sequence length \| 256 \|
	\| Positional encoding \| RoPE \|
	\| Normalization \| RMSNorm + QK-Norm \|
	\| Embeddings \| Tied (input = output) \|

	## Training

	EN-only corpus (43M lines, ~445M tokens):
	- `clean_en` (Tim's curated corpus)
	- `opensubtitles_en` (filtered for mislabeled French; deduplicated)

	Training run (run8):
	- 30,000 steps, lr 6e-5 cosine schedule, warmup 1000, effective batch 64
	- Validation PPL: 38.08 on the held-out EN val split
	- Test PPL: 37.88 on the held-out EN test split (no contamination)
	- 100 % argmax parity between PyTorch and the exported CoreML model

	## Files

	\| File \| Description \|
	\|---------------------------------------\|------------------------------------------------\|
	\| `keyboard_lm_seq128_fp16.mlpackage` \| CoreML mlprogram fp16, seq_len=128 (iOS) \|
	\| `tokenizer_en.json` \| BPE 16K tokenizer trained on the EN-only corpus\|
	\| `config.json` \| Model configuration \|

	## iOS usage notes

	- Strip trailing whitespace from prompts before tokenization. The model
	was trained on clean sentences and produces noisy subword fragments on
	inputs like `"Hey guys "` (with trailing space). Use
	`prompt.trimmingCharacters(in: .whitespacesAndNewlines)` before encoding.
	- 100 % argmax agreement between PyTorch and the exported CoreML model on
	validation prompts. Predictions on iOS match the PyTorch reference
	bit-for-bit (modulo fp16 quantization noise: max abs diff ≈ 0.011).

	## License

	Apache 2.0