Onit Keyboard LM (EN-only, run8)
A 40M parameter English-only language model designed for next-word prediction
in the Onit iOS keyboard. Replaces the previous bilingual run7 model that
surfaced French tokens (même, soit, des, présent) in EN keyboard
contexts.
Architecture
| Component | Value |
|---|---|
| Type | Causal LM (decoder-only) |
| Parameters | ~40M |
| Vocabulary | 16,384 BPE tokens (EN-only) |
| Embedding dim | 512 |
| Layers | 10 |
| Attention heads | 8 |
| FFN dim | 1408 (SwiGLU) |
| Max sequence length | 256 |
| Positional encoding | RoPE |
| Normalization | RMSNorm + QK-Norm |
| Embeddings | Tied (input = output) |
Training
EN-only corpus (43M lines, ~445M tokens):
clean_en(Tim's curated corpus)opensubtitles_en(filtered for mislabeled French; deduplicated)
Training run (run8):
- 30,000 steps, lr 6e-5 cosine schedule, warmup 1000, effective batch 64
- Validation PPL: 38.08 on the held-out EN val split
- Test PPL: 37.88 on the held-out EN test split (no contamination)
- 100 % argmax parity between PyTorch and the exported CoreML model
Files
| File | Description |
|---|---|
keyboard_lm_seq128_fp16.mlpackage |
CoreML mlprogram fp16, seq_len=128 (iOS) |
tokenizer_en.json |
BPE 16K tokenizer trained on the EN-only corpus |
config.json |
Model configuration |
iOS usage notes
- Strip trailing whitespace from prompts before tokenization. The model
was trained on clean sentences and produces noisy subword fragments on
inputs like
"Hey guys "(with trailing space). Useprompt.trimmingCharacters(in: .whitespacesAndNewlines)before encoding. - 100 % argmax agreement between PyTorch and the exported CoreML model on validation prompts. Predictions on iOS match the PyTorch reference bit-for-bit (modulo fp16 quantization noise: max abs diff ≈ 0.011).
License
Apache 2.0
- Downloads last month
- 2,920