--- language: - en license: apache-2.0 tags: - keyboard - language-model - mobile - ios - coreml - english library_name: pytorch pipeline_tag: text-generation --- # Onit Keyboard LM (EN-only, run8) A 40M parameter English-only language model designed for next-word prediction in the Onit iOS keyboard. Replaces the previous bilingual run7 model that surfaced French tokens (`même`, `soit`, `des`, `présent`) in EN keyboard contexts. ## Architecture | Component | Value | |-----------------------|-----------------------------| | Type | Causal LM (decoder-only) | | Parameters | ~40M | | Vocabulary | 16,384 BPE tokens (EN-only) | | Embedding dim | 512 | | Layers | 10 | | Attention heads | 8 | | FFN dim | 1408 (SwiGLU) | | Max sequence length | 256 | | Positional encoding | RoPE | | Normalization | RMSNorm + QK-Norm | | Embeddings | Tied (input = output) | ## Training EN-only corpus (43M lines, ~445M tokens): - `clean_en` (Tim's curated corpus) - `opensubtitles_en` (filtered for mislabeled French; deduplicated) Training run (run8): - 30,000 steps, lr 6e-5 cosine schedule, warmup 1000, effective batch 64 - Validation PPL: **38.08** on the held-out EN val split - Test PPL: **37.88** on the held-out EN test split (no contamination) - 100 % argmax parity between PyTorch and the exported CoreML model ## Files | File | Description | |---------------------------------------|------------------------------------------------| | `keyboard_lm_seq128_fp16.mlpackage` | CoreML mlprogram fp16, seq_len=128 (iOS) | | `tokenizer_en.json` | BPE 16K tokenizer trained on the EN-only corpus| | `config.json` | Model configuration | ## iOS usage notes - **Strip trailing whitespace from prompts before tokenization.** The model was trained on clean sentences and produces noisy subword fragments on inputs like `"Hey guys "` (with trailing space). Use `prompt.trimmingCharacters(in: .whitespacesAndNewlines)` before encoding. - 100 % argmax agreement between PyTorch and the exported CoreML model on validation prompts. Predictions on iOS match the PyTorch reference bit-for-bit (modulo fp16 quantization noise: max abs diff ≈ 0.011). ## License Apache 2.0