File size: 2,654 Bytes
456252b e50280a 456252b e50280a 456252b e50280a 456252b e50280a 456252b e50280a 456252b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | ---
language:
- en
license: apache-2.0
tags:
- keyboard
- language-model
- mobile
- ios
- coreml
- english
library_name: pytorch
pipeline_tag: text-generation
---
# Onit Keyboard LM (EN-only, run8)
A 40M parameter English-only language model designed for next-word prediction
in the Onit iOS keyboard. Replaces the previous bilingual run7 model that
surfaced French tokens (`même`, `soit`, `des`, `présent`) in EN keyboard
contexts.
## Architecture
| Component | Value |
|-----------------------|-----------------------------|
| Type | Causal LM (decoder-only) |
| Parameters | ~40M |
| Vocabulary | 16,384 BPE tokens (EN-only) |
| Embedding dim | 512 |
| Layers | 10 |
| Attention heads | 8 |
| FFN dim | 1408 (SwiGLU) |
| Max sequence length | 256 |
| Positional encoding | RoPE |
| Normalization | RMSNorm + QK-Norm |
| Embeddings | Tied (input = output) |
## Training
EN-only corpus (43M lines, ~445M tokens):
- `clean_en` (Tim's curated corpus)
- `opensubtitles_en` (filtered for mislabeled French; deduplicated)
Training run (run8):
- 30,000 steps, lr 6e-5 cosine schedule, warmup 1000, effective batch 64
- Validation PPL: **38.08** on the held-out EN val split
- Test PPL: **37.88** on the held-out EN test split (no contamination)
- 100 % argmax parity between PyTorch and the exported CoreML model
## Files
| File | Description |
|---------------------------------------|------------------------------------------------|
| `keyboard_lm_seq128_fp16.mlpackage` | CoreML mlprogram fp16, seq_len=128 (iOS) |
| `tokenizer_en.json` | BPE 16K tokenizer trained on the EN-only corpus|
| `config.json` | Model configuration |
## iOS usage notes
- **Strip trailing whitespace from prompts before tokenization.** The model
was trained on clean sentences and produces noisy subword fragments on
inputs like `"Hey guys "` (with trailing space). Use
`prompt.trimmingCharacters(in: .whitespacesAndNewlines)` before encoding.
- 100 % argmax agreement between PyTorch and the exported CoreML model on
validation prompts. Predictions on iOS match the PyTorch reference
bit-for-bit (modulo fp16 quantization noise: max abs diff ≈ 0.011).
## License
Apache 2.0
|