--- language: - af - am - ar - hy - az - bn - ba - be - bs - bg - ca - zh - hr - cs - da - nl - en - eo - et - eu - fil - fi - fr - fur - gl - ka - de - el - gu - ha - haw - he - hi - hu - is - id - ga - it - ja - kn - kk - km - ko - ky - lo - la - lv - lt - mk - mg - ms - ml - mt - mr - mn - my - ne - nn - oc - om - ps - fa - pl - pt - ro - ru - sa - sat - sr - sn - si - sk - sl - es - sw - sv - gsw - ta - tt - te - th - bo - tr - udm - uk - ur - uz - vi - cy - xh - yi - yo - zu tags: - keyboardrage - semantic-search - embeddings - typing-game - multilingual - faiss - granite-embedding license: mit datasets: - wiktionary - monkeytype --- # KeyboardRage Semantic Models Precomputed multilingual semantic embeddings and neighbor indices for the [KeyboardRage](https://github.com/EMRD95/keyboardrage) typing game's Galaxy visualization. ## Overview This repository contains the trained semantic models that power the 3D semantic word galaxy in KeyboardRage. 2,250,636 words across 108 languages are embedded into a 384-dimensional space using IBM's Granite multilingual embedding model, then projected to 3D via UMAP. Precomputed nearest-neighbor indices enable real-time similarity queries. ## Contents ### Global embeddings & index (15 GB) | File | Size | Description | |------|------|-------------| | `semantic_embeddings.f32.npy` | 3.3 GB | Float32 embeddings for all 2.25M words (384-dim, inner-product normalized) | | `semantic_faiss_hnsw.index` | 3.3 GB | FAISS flat inner-product index (exact cosine similarity over normalized vectors) | | `neighbor_ids.npy` | 1.7 GB | Precomputed global top-200 neighbor IDs (rows × 200, int64) | | `neighbor_scores.npy` | 1.7 GB | Precomputed global top-200 cosine similarity scores (float32) | | `semantic_index_meta.json` | ~1 KB | Model metadata (embedding model, dimensions, row count) | ### Per-language neighbor indices 108 languages, each with 4 files: - `neighbor_ids_{lang}.npy` — precomputed within-language top-200 neighbor IDs - `neighbor_scores_{lang}.npy` — cosine similarity scores - `lang_index_{lang}.npy` — global-ID → local-ID mapping - `neighbor_meta_{lang}.json` — per-language statistics ### 3D projection metadata | File | Size | Description | |------|------|-------------| | `atlas_data.parquet` | 88 MB | Word metadata: 3D UMAP coordinates (x, y, z), word, language, definition | ### Raw word embeddings (834 MB) `words_emb_merged/{lang}.json` — raw embedding vectors per language, used for regeneration workflows. ## Model Details - **Embedding model**: [ibm-granite/granite-embedding-97m-multilingual-r2](https://huggingface.co/ibm-granite/granite-embedding-97m-multilingual-r2) - **Embedding dimension**: 384 - **Total words embedded**: 2,250,636 - **Languages**: 108 - **Similarity metric**: Cosine similarity via normalized inner product - **3D projection**: UMAP (n_components=3, metric='cosine') - **Neighbor count**: Top 200 per word (global + per-language) ## 108 Supported Languages afrikaans, albanian, amharic, arabic (+egypt, +morocco), armenian (+western), azerbaijani, bangla, bashkir, belarusian (+lacinka), bosnian, bulgarian, catalan, chinese_simplified, chinese_traditional, croatian, czech, danish, dutch, english, esperanto (+h_sistemo, +x_sistemo), estonian, euskera, filipino, finnish, french, friulian, galician, georgian, german, greek, gujarati, hausa, hawaiian, hebrew, hindi, hungarian, icelandic, indonesian, irish, italian, japanese (hiragana, katakana, romaji), kannada, kazakh, khmer, korean, kyrgyz, lao, latin, latvian, lithuanian, macedonian, malagasy, malay, malayalam, maltese, marathi, mongolian, myanmar, nepali, norwegian_nynorsk, occitan, oromo, pashto, persian, polish, portuguese (+acentos_e_cedilha), romanian, russian, sanskrit, santali, serbian (+latin), shona, sinhala, slovak, slovenian, spanish, swahili, swedish, swiss_german, tamil, tatar (+crimean, +crimean_cyrillic), telugu, thai, tibetan, turkish, udmurt, ukrainian (+latynka), urdu, uzbek, vietnamese, welsh, xhosa, yiddish, yoruba, zulu ## On-Premise Deployment ### Prerequisites - Python 3.10+ - FastAPI, NumPy, DuckDB, PyArrow ### Quick start ```bash # 1. Clone the game code git clone https://github.com/EMRD95/keyboardrage cd keyboardrage # 2. Download models from HuggingFace ./setup.sh # 3. Run the semantic neighbors API cd galaxy/semantic pip install fastapi uvicorn numpy duckdb pyarrow python semantic_neighbors_server.py # API available at http://localhost:8703 ``` ### API Endpoints | Endpoint | Description | |----------|-------------| | `GET /health` | Server status, available languages, row count | | `GET /point/{id}` | Get word metadata by global ID | | `GET /neighbors/{id}?k=10&language=french` | Get nearest neighbors (global or per-language) | | `GET /search?q=mot&language=french` | Full-text word search | ## Source Code The KeyboardRage game source code and visualization themes are at: **[github.com/EMRD95/keyboardrage](https://github.com/EMRD95/keyboardrage)** ## Regeneration To rebuild these models from scratch: ```bash # 1. Rebuild embeddings from merged word lists cd galaxy && ./rebuild_from_merged_words.sh # 2. Rebuild semantic index cd semantic && python build_semantic_index.py # 3. Precompute neighbors python precompute_neighbors.py --per-language # 4. Rebuild 3D projection cd ../3D_galaxy && ./run_umap50_projection_rebuild.sh ``` All rebuild scripts are in the [GitHub repository](https://github.com/EMRD95/keyboardrage/tree/develop/galaxy). ## License MIT — same as KeyboardRage.