KeyboardRage Semantic Models
Precomputed multilingual semantic embeddings and neighbor indices for the KeyboardRage typing game's Galaxy visualization.
Overview
This repository contains the trained semantic models that power the 3D semantic word galaxy in KeyboardRage. 2,250,636 words across 108 languages are embedded into a 384-dimensional space using IBM's Granite multilingual embedding model, then projected to 3D via UMAP. Precomputed nearest-neighbor indices enable real-time similarity queries.
Contents
Global embeddings & index (15 GB)
| File | Size | Description |
|---|---|---|
semantic_embeddings.f32.npy |
3.3 GB | Float32 embeddings for all 2.25M words (384-dim, inner-product normalized) |
semantic_faiss_hnsw.index |
3.3 GB | FAISS flat inner-product index (exact cosine similarity over normalized vectors) |
neighbor_ids.npy |
1.7 GB | Precomputed global top-200 neighbor IDs (rows Γ 200, int64) |
neighbor_scores.npy |
1.7 GB | Precomputed global top-200 cosine similarity scores (float32) |
semantic_index_meta.json |
~1 KB | Model metadata (embedding model, dimensions, row count) |
Per-language neighbor indices
108 languages, each with 4 files:
neighbor_ids_{lang}.npyβ precomputed within-language top-200 neighbor IDsneighbor_scores_{lang}.npyβ cosine similarity scoreslang_index_{lang}.npyβ global-ID β local-ID mappingneighbor_meta_{lang}.jsonβ per-language statistics
3D projection metadata
| File | Size | Description |
|---|---|---|
atlas_data.parquet |
88 MB | Word metadata: 3D UMAP coordinates (x, y, z), word, language, definition |
Raw word embeddings (834 MB)
words_emb_merged/{lang}.json β raw embedding vectors per language, used for regeneration workflows.
Model Details
- Embedding model: ibm-granite/granite-embedding-97m-multilingual-r2
- Embedding dimension: 384
- Total words embedded: 2,250,636
- Languages: 108
- Similarity metric: Cosine similarity via normalized inner product
- 3D projection: UMAP (n_components=3, metric='cosine')
- Neighbor count: Top 200 per word (global + per-language)
108 Supported Languages
afrikaans, albanian, amharic, arabic (+egypt, +morocco), armenian (+western), azerbaijani, bangla, bashkir, belarusian (+lacinka), bosnian, bulgarian, catalan, chinese_simplified, chinese_traditional, croatian, czech, danish, dutch, english, esperanto (+h_sistemo, +x_sistemo), estonian, euskera, filipino, finnish, french, friulian, galician, georgian, german, greek, gujarati, hausa, hawaiian, hebrew, hindi, hungarian, icelandic, indonesian, irish, italian, japanese (hiragana, katakana, romaji), kannada, kazakh, khmer, korean, kyrgyz, lao, latin, latvian, lithuanian, macedonian, malagasy, malay, malayalam, maltese, marathi, mongolian, myanmar, nepali, norwegian_nynorsk, occitan, oromo, pashto, persian, polish, portuguese (+acentos_e_cedilha), romanian, russian, sanskrit, santali, serbian (+latin), shona, sinhala, slovak, slovenian, spanish, swahili, swedish, swiss_german, tamil, tatar (+crimean, +crimean_cyrillic), telugu, thai, tibetan, turkish, udmurt, ukrainian (+latynka), urdu, uzbek, vietnamese, welsh, xhosa, yiddish, yoruba, zulu
On-Premise Deployment
Prerequisites
- Python 3.10+
- FastAPI, NumPy, DuckDB, PyArrow
Quick start
# 1. Clone the game code
git clone https://github.com/EMRD95/keyboardrage
cd keyboardrage
# 2. Download models from HuggingFace
./setup.sh
# 3. Run the semantic neighbors API
cd galaxy/semantic
pip install fastapi uvicorn numpy duckdb pyarrow
python semantic_neighbors_server.py
# API available at http://localhost:8703
API Endpoints
| Endpoint | Description |
|---|---|
GET /health |
Server status, available languages, row count |
GET /point/{id} |
Get word metadata by global ID |
GET /neighbors/{id}?k=10&language=french |
Get nearest neighbors (global or per-language) |
GET /search?q=mot&language=french |
Full-text word search |
Source Code
The KeyboardRage game source code and visualization themes are at: github.com/EMRD95/keyboardrage
Regeneration
To rebuild these models from scratch:
# 1. Rebuild embeddings from merged word lists
cd galaxy && ./rebuild_from_merged_words.sh
# 2. Rebuild semantic index
cd semantic && python build_semantic_index.py
# 3. Precompute neighbors
python precompute_neighbors.py --per-language
# 4. Rebuild 3D projection
cd ../3D_galaxy && ./run_umap50_projection_rebuild.sh
All rebuild scripts are in the GitHub repository.
License
MIT β same as KeyboardRage.