File size: 5,553 Bytes
0a81147 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 | ---
language:
- af
- am
- ar
- hy
- az
- bn
- ba
- be
- bs
- bg
- ca
- zh
- hr
- cs
- da
- nl
- en
- eo
- et
- eu
- fil
- fi
- fr
- fur
- gl
- ka
- de
- el
- gu
- ha
- haw
- he
- hi
- hu
- is
- id
- ga
- it
- ja
- kn
- kk
- km
- ko
- ky
- lo
- la
- lv
- lt
- mk
- mg
- ms
- ml
- mt
- mr
- mn
- my
- ne
- nn
- oc
- om
- ps
- fa
- pl
- pt
- ro
- ru
- sa
- sat
- sr
- sn
- si
- sk
- sl
- es
- sw
- sv
- gsw
- ta
- tt
- te
- th
- bo
- tr
- udm
- uk
- ur
- uz
- vi
- cy
- xh
- yi
- yo
- zu
tags:
- keyboardrage
- semantic-search
- embeddings
- typing-game
- multilingual
- faiss
- granite-embedding
license: mit
datasets:
- wiktionary
- monkeytype
---
# KeyboardRage Semantic Models
Precomputed multilingual semantic embeddings and neighbor indices for the [KeyboardRage](https://github.com/EMRD95/keyboardrage) typing game's Galaxy visualization.
## Overview
This repository contains the trained semantic models that power the 3D semantic word galaxy in KeyboardRage. 2,250,636 words across 108 languages are embedded into a 384-dimensional space using IBM's Granite multilingual embedding model, then projected to 3D via UMAP. Precomputed nearest-neighbor indices enable real-time similarity queries.
## Contents
### Global embeddings & index (15 GB)
| File | Size | Description |
|------|------|-------------|
| `semantic_embeddings.f32.npy` | 3.3 GB | Float32 embeddings for all 2.25M words (384-dim, inner-product normalized) |
| `semantic_faiss_hnsw.index` | 3.3 GB | FAISS flat inner-product index (exact cosine similarity over normalized vectors) |
| `neighbor_ids.npy` | 1.7 GB | Precomputed global top-200 neighbor IDs (rows × 200, int64) |
| `neighbor_scores.npy` | 1.7 GB | Precomputed global top-200 cosine similarity scores (float32) |
| `semantic_index_meta.json` | ~1 KB | Model metadata (embedding model, dimensions, row count) |
### Per-language neighbor indices
108 languages, each with 4 files:
- `neighbor_ids_{lang}.npy` — precomputed within-language top-200 neighbor IDs
- `neighbor_scores_{lang}.npy` — cosine similarity scores
- `lang_index_{lang}.npy` — global-ID → local-ID mapping
- `neighbor_meta_{lang}.json` — per-language statistics
### 3D projection metadata
| File | Size | Description |
|------|------|-------------|
| `atlas_data.parquet` | 88 MB | Word metadata: 3D UMAP coordinates (x, y, z), word, language, definition |
### Raw word embeddings (834 MB)
`words_emb_merged/{lang}.json` — raw embedding vectors per language, used for regeneration workflows.
## Model Details
- **Embedding model**: [ibm-granite/granite-embedding-97m-multilingual-r2](https://huggingface.co/ibm-granite/granite-embedding-97m-multilingual-r2)
- **Embedding dimension**: 384
- **Total words embedded**: 2,250,636
- **Languages**: 108
- **Similarity metric**: Cosine similarity via normalized inner product
- **3D projection**: UMAP (n_components=3, metric='cosine')
- **Neighbor count**: Top 200 per word (global + per-language)
## 108 Supported Languages
afrikaans, albanian, amharic, arabic (+egypt, +morocco), armenian (+western), azerbaijani, bangla, bashkir, belarusian (+lacinka), bosnian, bulgarian, catalan, chinese_simplified, chinese_traditional, croatian, czech, danish, dutch, english, esperanto (+h_sistemo, +x_sistemo), estonian, euskera, filipino, finnish, french, friulian, galician, georgian, german, greek, gujarati, hausa, hawaiian, hebrew, hindi, hungarian, icelandic, indonesian, irish, italian, japanese (hiragana, katakana, romaji), kannada, kazakh, khmer, korean, kyrgyz, lao, latin, latvian, lithuanian, macedonian, malagasy, malay, malayalam, maltese, marathi, mongolian, myanmar, nepali, norwegian_nynorsk, occitan, oromo, pashto, persian, polish, portuguese (+acentos_e_cedilha), romanian, russian, sanskrit, santali, serbian (+latin), shona, sinhala, slovak, slovenian, spanish, swahili, swedish, swiss_german, tamil, tatar (+crimean, +crimean_cyrillic), telugu, thai, tibetan, turkish, udmurt, ukrainian (+latynka), urdu, uzbek, vietnamese, welsh, xhosa, yiddish, yoruba, zulu
## On-Premise Deployment
### Prerequisites
- Python 3.10+
- FastAPI, NumPy, DuckDB, PyArrow
### Quick start
```bash
# 1. Clone the game code
git clone https://github.com/EMRD95/keyboardrage
cd keyboardrage
# 2. Download models from HuggingFace
./setup.sh
# 3. Run the semantic neighbors API
cd galaxy/semantic
pip install fastapi uvicorn numpy duckdb pyarrow
python semantic_neighbors_server.py
# API available at http://localhost:8703
```
### API Endpoints
| Endpoint | Description |
|----------|-------------|
| `GET /health` | Server status, available languages, row count |
| `GET /point/{id}` | Get word metadata by global ID |
| `GET /neighbors/{id}?k=10&language=french` | Get nearest neighbors (global or per-language) |
| `GET /search?q=mot&language=french` | Full-text word search |
## Source Code
The KeyboardRage game source code and visualization themes are at:
**[github.com/EMRD95/keyboardrage](https://github.com/EMRD95/keyboardrage)**
## Regeneration
To rebuild these models from scratch:
```bash
# 1. Rebuild embeddings from merged word lists
cd galaxy && ./rebuild_from_merged_words.sh
# 2. Rebuild semantic index
cd semantic && python build_semantic_index.py
# 3. Precompute neighbors
python precompute_neighbors.py --per-language
# 4. Rebuild 3D projection
cd ../3D_galaxy && ./run_umap50_projection_rebuild.sh
```
All rebuild scripts are in the [GitHub repository](https://github.com/EMRD95/keyboardrage/tree/develop/galaxy).
## License
MIT — same as KeyboardRage.
|