File size: 5,553 Bytes
0a81147
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
language:
- af
- am
- ar
- hy
- az
- bn
- ba
- be
- bs
- bg
- ca
- zh
- hr
- cs
- da
- nl
- en
- eo
- et
- eu
- fil
- fi
- fr
- fur
- gl
- ka
- de
- el
- gu
- ha
- haw
- he
- hi
- hu
- is
- id
- ga
- it
- ja
- kn
- kk
- km
- ko
- ky
- lo
- la
- lv
- lt
- mk
- mg
- ms
- ml
- mt
- mr
- mn
- my
- ne
- nn
- oc
- om
- ps
- fa
- pl
- pt
- ro
- ru
- sa
- sat
- sr
- sn
- si
- sk
- sl
- es
- sw
- sv
- gsw
- ta
- tt
- te
- th
- bo
- tr
- udm
- uk
- ur
- uz
- vi
- cy
- xh
- yi
- yo
- zu
tags:
- keyboardrage
- semantic-search
- embeddings
- typing-game
- multilingual
- faiss
- granite-embedding
license: mit
datasets:
- wiktionary
- monkeytype
---

# KeyboardRage Semantic Models

Precomputed multilingual semantic embeddings and neighbor indices for the [KeyboardRage](https://github.com/EMRD95/keyboardrage) typing game's Galaxy visualization.

## Overview

This repository contains the trained semantic models that power the 3D semantic word galaxy in KeyboardRage. 2,250,636 words across 108 languages are embedded into a 384-dimensional space using IBM's Granite multilingual embedding model, then projected to 3D via UMAP. Precomputed nearest-neighbor indices enable real-time similarity queries.

## Contents

### Global embeddings & index (15 GB)
| File | Size | Description |
|------|------|-------------|
| `semantic_embeddings.f32.npy` | 3.3 GB | Float32 embeddings for all 2.25M words (384-dim, inner-product normalized) |
| `semantic_faiss_hnsw.index` | 3.3 GB | FAISS flat inner-product index (exact cosine similarity over normalized vectors) |
| `neighbor_ids.npy` | 1.7 GB | Precomputed global top-200 neighbor IDs (rows × 200, int64) |
| `neighbor_scores.npy` | 1.7 GB | Precomputed global top-200 cosine similarity scores (float32) |
| `semantic_index_meta.json` | ~1 KB | Model metadata (embedding model, dimensions, row count) |

### Per-language neighbor indices
108 languages, each with 4 files:
- `neighbor_ids_{lang}.npy` — precomputed within-language top-200 neighbor IDs
- `neighbor_scores_{lang}.npy` — cosine similarity scores
- `lang_index_{lang}.npy` — global-ID → local-ID mapping
- `neighbor_meta_{lang}.json` — per-language statistics

### 3D projection metadata
| File | Size | Description |
|------|------|-------------|
| `atlas_data.parquet` | 88 MB | Word metadata: 3D UMAP coordinates (x, y, z), word, language, definition |

### Raw word embeddings (834 MB)
`words_emb_merged/{lang}.json` — raw embedding vectors per language, used for regeneration workflows.

## Model Details

- **Embedding model**: [ibm-granite/granite-embedding-97m-multilingual-r2](https://huggingface.co/ibm-granite/granite-embedding-97m-multilingual-r2)
- **Embedding dimension**: 384
- **Total words embedded**: 2,250,636
- **Languages**: 108
- **Similarity metric**: Cosine similarity via normalized inner product
- **3D projection**: UMAP (n_components=3, metric='cosine')
- **Neighbor count**: Top 200 per word (global + per-language)

## 108 Supported Languages

afrikaans, albanian, amharic, arabic (+egypt, +morocco), armenian (+western), azerbaijani, bangla, bashkir, belarusian (+lacinka), bosnian, bulgarian, catalan, chinese_simplified, chinese_traditional, croatian, czech, danish, dutch, english, esperanto (+h_sistemo, +x_sistemo), estonian, euskera, filipino, finnish, french, friulian, galician, georgian, german, greek, gujarati, hausa, hawaiian, hebrew, hindi, hungarian, icelandic, indonesian, irish, italian, japanese (hiragana, katakana, romaji), kannada, kazakh, khmer, korean, kyrgyz, lao, latin, latvian, lithuanian, macedonian, malagasy, malay, malayalam, maltese, marathi, mongolian, myanmar, nepali, norwegian_nynorsk, occitan, oromo, pashto, persian, polish, portuguese (+acentos_e_cedilha), romanian, russian, sanskrit, santali, serbian (+latin), shona, sinhala, slovak, slovenian, spanish, swahili, swedish, swiss_german, tamil, tatar (+crimean, +crimean_cyrillic), telugu, thai, tibetan, turkish, udmurt, ukrainian (+latynka), urdu, uzbek, vietnamese, welsh, xhosa, yiddish, yoruba, zulu

## On-Premise Deployment

### Prerequisites
- Python 3.10+
- FastAPI, NumPy, DuckDB, PyArrow

### Quick start

```bash
# 1. Clone the game code
git clone https://github.com/EMRD95/keyboardrage
cd keyboardrage

# 2. Download models from HuggingFace
./setup.sh

# 3. Run the semantic neighbors API
cd galaxy/semantic
pip install fastapi uvicorn numpy duckdb pyarrow
python semantic_neighbors_server.py
# API available at http://localhost:8703
```

### API Endpoints

| Endpoint | Description |
|----------|-------------|
| `GET /health` | Server status, available languages, row count |
| `GET /point/{id}` | Get word metadata by global ID |
| `GET /neighbors/{id}?k=10&language=french` | Get nearest neighbors (global or per-language) |
| `GET /search?q=mot&language=french` | Full-text word search |

## Source Code

The KeyboardRage game source code and visualization themes are at:
**[github.com/EMRD95/keyboardrage](https://github.com/EMRD95/keyboardrage)**

## Regeneration

To rebuild these models from scratch:

```bash
# 1. Rebuild embeddings from merged word lists
cd galaxy && ./rebuild_from_merged_words.sh

# 2. Rebuild semantic index
cd semantic && python build_semantic_index.py

# 3. Precompute neighbors
python precompute_neighbors.py --per-language

# 4. Rebuild 3D projection
cd ../3D_galaxy && ./run_umap50_projection_rebuild.sh
```

All rebuild scripts are in the [GitHub repository](https://github.com/EMRD95/keyboardrage/tree/develop/galaxy).

## License

MIT — same as KeyboardRage.