niduank commited on
Commit
456252b
·
0 Parent(s):

Upload Onit Keyboard LM (run7, tok_v2, 40M params)

Browse files
Files changed (6) hide show
  1. .gitattributes +1 -0
  2. README.md +136 -0
  3. checkpoint_full.pt +3 -0
  4. config.json +18 -0
  5. model.pt +3 -0
  6. tokenizer.json +0 -0
.gitattributes ADDED
@@ -0,0 +1 @@
 
 
1
+ *.pt filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - fr
5
+ license: apache-2.0
6
+ tags:
7
+ - keyboard
8
+ - language-model
9
+ - mobile
10
+ - ios
11
+ - coreml
12
+ - bilingual
13
+ library_name: pytorch
14
+ pipeline_tag: text-generation
15
+ ---
16
+
17
+ # Onit Keyboard LM
18
+
19
+ A **41M parameter** bilingual (English + French) language model designed for **mobile keyboard prediction** on iOS.
20
+
21
+ ## Model Description
22
+
23
+ Onit Keyboard LM is a compact causal language model optimized for next-word prediction in a mobile keyboard context. It supports both English and French, including code-switching between the two languages.
24
+
25
+ ### Architecture
26
+
27
+ | Component | Value |
28
+ |-----------|-------|
29
+ | Type | Causal LM (decoder-only) |
30
+ | Parameters | ~41M |
31
+ | Vocabulary | 16,384 BPE tokens |
32
+ | Embedding dim | 512 |
33
+ | Layers | 10 |
34
+ | Attention heads | 8 |
35
+ | FFN dim | 1408 (SwiGLU) |
36
+ | Max sequence length | 256 |
37
+ | Positional encoding | RoPE |
38
+ | Normalization | RMSNorm + QK-Norm |
39
+ | Embeddings | Tied (input = output) |
40
+
41
+ ### Key Design Choices
42
+
43
+ - **SwiGLU FFN** for better parameter efficiency at small scale
44
+ - **QK-Norm** for stable training without careful LR tuning
45
+ - **RoPE** for length generalization
46
+ - **Tied embeddings** to reduce parameter count (critical for mobile)
47
+ - **BPE tokenizer** (16K vocab) trained on the bilingual data mix
48
+
49
+ ## Training
50
+
51
+ ### Dataset (Phase 2)
52
+
53
+ The model was trained on a diverse bilingual mix:
54
+
55
+ | Source | Language | Share |
56
+ |--------|----------|-------|
57
+ | OpenSubtitles | FR + EN | ~40% |
58
+ | Wikipedia | FR + EN | ~30% |
59
+ | C4 (web) | FR + EN | ~30% |
60
+
61
+ Total: ~13.6M sentences, ~2.7 GB of clean text.
62
+
63
+ ### Hyperparameters
64
+
65
+ | Parameter | Value |
66
+ |-----------|-------|
67
+ | Training steps | 30,000 |
68
+ | Effective batch size | 64 (32 x 2 grad accum) |
69
+ | Learning rate | 6e-5 (cosine decay) |
70
+ | Warmup steps | 1,000 |
71
+ | Precision | bf16 mixed |
72
+ | Optimizer | AdamW |
73
+
74
+ ### Results
75
+
76
+ | Metric | Value |
77
+ |--------|-------|
78
+ | Training loss (final) | 2.01 |
79
+ | Validation PPL | 58.8 |
80
+ | Tokens seen | 491M |
81
+
82
+ ## Usage
83
+
84
+ ### PyTorch
85
+
86
+ ```python
87
+ import torch
88
+ from tokenizers import Tokenizer
89
+ from keyboard_lm.model import JointUniLM, ModelConfig
90
+
91
+ # Load
92
+ ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
93
+ config = ModelConfig(**ckpt["model_config"])
94
+ model = JointUniLM(config)
95
+ model.load_state_dict(ckpt["model_state_dict"])
96
+ model.eval()
97
+
98
+ tokenizer = Tokenizer.from_file("tokenizer.json")
99
+
100
+ # Predict next token
101
+ prompt = "I'm going to the"
102
+ ids = [config.bos_token_id] + tokenizer.encode(prompt).ids
103
+ input_ids = torch.tensor([ids])
104
+
105
+ with torch.no_grad():
106
+ logits, _ = model(input_ids)
107
+ probs = torch.softmax(logits[0, -1], dim=-1)
108
+ top5 = torch.topk(probs, 5)
109
+
110
+ for prob, idx in zip(top5.values, top5.indices):
111
+ print(f" {tokenizer.decode([idx.item()]):>10} ({prob:.1%})")
112
+ ```
113
+
114
+ ### CoreML (iOS)
115
+
116
+ See `scripts/export_coreml.py` in the [GitHub repo](https://github.com/synth-inc/onit-keyboard-lm) for CoreML conversion.
117
+
118
+ ## Files
119
+
120
+ | File | Description |
121
+ |------|-------------|
122
+ | `model.pt` | Model weights + config (no optimizer) |
123
+ | `checkpoint_full.pt` | Full training checkpoint (with optimizer, for resume) |
124
+ | `config.json` | Model configuration |
125
+ | `tokenizer.json` | BPE tokenizer (v2, trained on Phase 2 mix) |
126
+
127
+ ## Limitations
128
+
129
+ - Optimized for short text (keyboard input), not long-form generation
130
+ - May produce grammatical errors in French (e.g., double negatives)
131
+ - 256-token context window limits long-range coherence
132
+ - Not suitable for factual Q&A or instruction following
133
+
134
+ ## License
135
+
136
+ Apache 2.0
checkpoint_full.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1815019697c1ba8ac9c770097bd0234d9ead3a92c8aa74f40c567aef220eab7c
3
+ size 486296722
config.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size": 16384,
3
+ "dim": 512,
4
+ "num_layers": 10,
5
+ "num_heads": 8,
6
+ "ffn_dim": 1408,
7
+ "max_seq_len": 256,
8
+ "dropout": 0.1,
9
+ "tied_embeddings": true,
10
+ "qk_norm": true,
11
+ "rope_base": 10000.0,
12
+ "rms_norm_eps": 1e-06,
13
+ "pad_token_id": 0,
14
+ "unk_token_id": 1,
15
+ "bos_token_id": 2,
16
+ "eos_token_id": 3,
17
+ "mask_token_id": 4
18
+ }
model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d8648eaa471f0baf1229546dfd29bb4b5532b9b37a1338c82bb50ce6be074049
3
+ size 162087274
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff