Onit Keyboard LM (EN-only, run8)

A 40M parameter English-only language model designed for next-word prediction in the Onit iOS keyboard. Replaces the previous bilingual run7 model that surfaced French tokens (même, soit, des, présent) in EN keyboard contexts.

Architecture

Component	Value
Type	Causal LM (decoder-only)
Parameters	~40M
Vocabulary	16,384 BPE tokens (EN-only)
Embedding dim	512
Layers	10
Attention heads	8
FFN dim	1408 (SwiGLU)
Max sequence length	256
Positional encoding	RoPE
Normalization	RMSNorm + QK-Norm
Embeddings	Tied (input = output)

Training

EN-only corpus (43M lines, ~445M tokens):

clean_en (Tim's curated corpus)
opensubtitles_en (filtered for mislabeled French; deduplicated)

Training run (run8):

30,000 steps, lr 6e-5 cosine schedule, warmup 1000, effective batch 64
Validation PPL: 38.08 on the held-out EN val split
Test PPL: 37.88 on the held-out EN test split (no contamination)
100 % argmax parity between PyTorch and the exported CoreML model

Files

File	Description
`keyboard_lm_seq128_fp16.mlpackage`	CoreML mlprogram fp16, seq_len=128 (iOS)
`tokenizer_en.json`	BPE 16K tokenizer trained on the EN-only corpus
`config.json`	Model configuration

iOS usage notes

Strip trailing whitespace from prompts before tokenization. The model was trained on clean sentences and produces noisy subword fragments on inputs like "Hey guys " (with trailing space). Use prompt.trimmingCharacters(in: .whitespacesAndNewlines) before encoding.
100 % argmax agreement between PyTorch and the exported CoreML model on validation prompts. Predictions on iOS match the PyTorch reference bit-for-bit (modulo fp16 quantization noise: max abs diff ≈ 0.011).

License

Apache 2.0

Downloads last month: 2,920