| --- |
| language: |
| - en |
| license: apache-2.0 |
| tags: |
| - keyboard |
| - language-model |
| - mobile |
| - ios |
| - coreml |
| - english |
| library_name: pytorch |
| pipeline_tag: text-generation |
| --- |
| |
| # Onit Keyboard LM (EN-only, run8) |
|
|
| A 40M parameter English-only language model designed for next-word prediction |
| in the Onit iOS keyboard. Replaces the previous bilingual run7 model that |
| surfaced French tokens (`même`, `soit`, `des`, `présent`) in EN keyboard |
| contexts. |
|
|
| ## Architecture |
|
|
| | Component | Value | |
| |-----------------------|-----------------------------| |
| | Type | Causal LM (decoder-only) | |
| | Parameters | ~40M | |
| | Vocabulary | 16,384 BPE tokens (EN-only) | |
| | Embedding dim | 512 | |
| | Layers | 10 | |
| | Attention heads | 8 | |
| | FFN dim | 1408 (SwiGLU) | |
| | Max sequence length | 256 | |
| | Positional encoding | RoPE | |
| | Normalization | RMSNorm + QK-Norm | |
| | Embeddings | Tied (input = output) | |
|
|
| ## Training |
|
|
| EN-only corpus (43M lines, ~445M tokens): |
| - `clean_en` (Tim's curated corpus) |
| - `opensubtitles_en` (filtered for mislabeled French; deduplicated) |
|
|
| Training run (run8): |
| - 30,000 steps, lr 6e-5 cosine schedule, warmup 1000, effective batch 64 |
| - Validation PPL: **38.08** on the held-out EN val split |
| - Test PPL: **37.88** on the held-out EN test split (no contamination) |
| - 100 % argmax parity between PyTorch and the exported CoreML model |
|
|
| ## Files |
|
|
| | File | Description | |
| |---------------------------------------|------------------------------------------------| |
| | `keyboard_lm_seq128_fp16.mlpackage` | CoreML mlprogram fp16, seq_len=128 (iOS) | |
| | `tokenizer_en.json` | BPE 16K tokenizer trained on the EN-only corpus| |
| | `config.json` | Model configuration | |
|
|
| ## iOS usage notes |
|
|
| - **Strip trailing whitespace from prompts before tokenization.** The model |
| was trained on clean sentences and produces noisy subword fragments on |
| inputs like `"Hey guys "` (with trailing space). Use |
| `prompt.trimmingCharacters(in: .whitespacesAndNewlines)` before encoding. |
| - 100 % argmax agreement between PyTorch and the exported CoreML model on |
| validation prompts. Predictions on iOS match the PyTorch reference |
| bit-for-bit (modulo fp16 quantization noise: max abs diff ≈ 0.011). |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|