Onit Keyboard LM (EN-only, run8)

A 40M parameter English-only language model designed for next-word prediction in the Onit iOS keyboard. Replaces the previous bilingual run7 model that surfaced French tokens (même, soit, des, présent) in EN keyboard contexts.

Architecture

Component Value
Type Causal LM (decoder-only)
Parameters ~40M
Vocabulary 16,384 BPE tokens (EN-only)
Embedding dim 512
Layers 10
Attention heads 8
FFN dim 1408 (SwiGLU)
Max sequence length 256
Positional encoding RoPE
Normalization RMSNorm + QK-Norm
Embeddings Tied (input = output)

Training

EN-only corpus (43M lines, ~445M tokens):

  • clean_en (Tim's curated corpus)
  • opensubtitles_en (filtered for mislabeled French; deduplicated)

Training run (run8):

  • 30,000 steps, lr 6e-5 cosine schedule, warmup 1000, effective batch 64
  • Validation PPL: 38.08 on the held-out EN val split
  • Test PPL: 37.88 on the held-out EN test split (no contamination)
  • 100 % argmax parity between PyTorch and the exported CoreML model

Files

File Description
keyboard_lm_seq128_fp16.mlpackage CoreML mlprogram fp16, seq_len=128 (iOS)
tokenizer_en.json BPE 16K tokenizer trained on the EN-only corpus
config.json Model configuration

iOS usage notes

  • Strip trailing whitespace from prompts before tokenization. The model was trained on clean sentences and produces noisy subword fragments on inputs like "Hey guys " (with trailing space). Use prompt.trimmingCharacters(in: .whitespacesAndNewlines) before encoding.
  • 100 % argmax agreement between PyTorch and the exported CoreML model on validation prompts. Predictions on iOS match the PyTorch reference bit-for-bit (modulo fp16 quantization noise: max abs diff ≈ 0.011).

License

Apache 2.0

Downloads last month
2,920
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support