emrd95's picture
Upload README.md with huggingface_hub
0a81147 verified
---
language:
- af
- am
- ar
- hy
- az
- bn
- ba
- be
- bs
- bg
- ca
- zh
- hr
- cs
- da
- nl
- en
- eo
- et
- eu
- fil
- fi
- fr
- fur
- gl
- ka
- de
- el
- gu
- ha
- haw
- he
- hi
- hu
- is
- id
- ga
- it
- ja
- kn
- kk
- km
- ko
- ky
- lo
- la
- lv
- lt
- mk
- mg
- ms
- ml
- mt
- mr
- mn
- my
- ne
- nn
- oc
- om
- ps
- fa
- pl
- pt
- ro
- ru
- sa
- sat
- sr
- sn
- si
- sk
- sl
- es
- sw
- sv
- gsw
- ta
- tt
- te
- th
- bo
- tr
- udm
- uk
- ur
- uz
- vi
- cy
- xh
- yi
- yo
- zu
tags:
- keyboardrage
- semantic-search
- embeddings
- typing-game
- multilingual
- faiss
- granite-embedding
license: mit
datasets:
- wiktionary
- monkeytype
---
# KeyboardRage Semantic Models
Precomputed multilingual semantic embeddings and neighbor indices for the [KeyboardRage](https://github.com/EMRD95/keyboardrage) typing game's Galaxy visualization.
## Overview
This repository contains the trained semantic models that power the 3D semantic word galaxy in KeyboardRage. 2,250,636 words across 108 languages are embedded into a 384-dimensional space using IBM's Granite multilingual embedding model, then projected to 3D via UMAP. Precomputed nearest-neighbor indices enable real-time similarity queries.
## Contents
### Global embeddings & index (15 GB)
| File | Size | Description |
|------|------|-------------|
| `semantic_embeddings.f32.npy` | 3.3 GB | Float32 embeddings for all 2.25M words (384-dim, inner-product normalized) |
| `semantic_faiss_hnsw.index` | 3.3 GB | FAISS flat inner-product index (exact cosine similarity over normalized vectors) |
| `neighbor_ids.npy` | 1.7 GB | Precomputed global top-200 neighbor IDs (rows × 200, int64) |
| `neighbor_scores.npy` | 1.7 GB | Precomputed global top-200 cosine similarity scores (float32) |
| `semantic_index_meta.json` | ~1 KB | Model metadata (embedding model, dimensions, row count) |
### Per-language neighbor indices
108 languages, each with 4 files:
- `neighbor_ids_{lang}.npy` — precomputed within-language top-200 neighbor IDs
- `neighbor_scores_{lang}.npy` — cosine similarity scores
- `lang_index_{lang}.npy` — global-ID → local-ID mapping
- `neighbor_meta_{lang}.json` — per-language statistics
### 3D projection metadata
| File | Size | Description |
|------|------|-------------|
| `atlas_data.parquet` | 88 MB | Word metadata: 3D UMAP coordinates (x, y, z), word, language, definition |
### Raw word embeddings (834 MB)
`words_emb_merged/{lang}.json` — raw embedding vectors per language, used for regeneration workflows.
## Model Details
- **Embedding model**: [ibm-granite/granite-embedding-97m-multilingual-r2](https://huggingface.co/ibm-granite/granite-embedding-97m-multilingual-r2)
- **Embedding dimension**: 384
- **Total words embedded**: 2,250,636
- **Languages**: 108
- **Similarity metric**: Cosine similarity via normalized inner product
- **3D projection**: UMAP (n_components=3, metric='cosine')
- **Neighbor count**: Top 200 per word (global + per-language)
## 108 Supported Languages
afrikaans, albanian, amharic, arabic (+egypt, +morocco), armenian (+western), azerbaijani, bangla, bashkir, belarusian (+lacinka), bosnian, bulgarian, catalan, chinese_simplified, chinese_traditional, croatian, czech, danish, dutch, english, esperanto (+h_sistemo, +x_sistemo), estonian, euskera, filipino, finnish, french, friulian, galician, georgian, german, greek, gujarati, hausa, hawaiian, hebrew, hindi, hungarian, icelandic, indonesian, irish, italian, japanese (hiragana, katakana, romaji), kannada, kazakh, khmer, korean, kyrgyz, lao, latin, latvian, lithuanian, macedonian, malagasy, malay, malayalam, maltese, marathi, mongolian, myanmar, nepali, norwegian_nynorsk, occitan, oromo, pashto, persian, polish, portuguese (+acentos_e_cedilha), romanian, russian, sanskrit, santali, serbian (+latin), shona, sinhala, slovak, slovenian, spanish, swahili, swedish, swiss_german, tamil, tatar (+crimean, +crimean_cyrillic), telugu, thai, tibetan, turkish, udmurt, ukrainian (+latynka), urdu, uzbek, vietnamese, welsh, xhosa, yiddish, yoruba, zulu
## On-Premise Deployment
### Prerequisites
- Python 3.10+
- FastAPI, NumPy, DuckDB, PyArrow
### Quick start
```bash
# 1. Clone the game code
git clone https://github.com/EMRD95/keyboardrage
cd keyboardrage
# 2. Download models from HuggingFace
./setup.sh
# 3. Run the semantic neighbors API
cd galaxy/semantic
pip install fastapi uvicorn numpy duckdb pyarrow
python semantic_neighbors_server.py
# API available at http://localhost:8703
```
### API Endpoints
| Endpoint | Description |
|----------|-------------|
| `GET /health` | Server status, available languages, row count |
| `GET /point/{id}` | Get word metadata by global ID |
| `GET /neighbors/{id}?k=10&language=french` | Get nearest neighbors (global or per-language) |
| `GET /search?q=mot&language=french` | Full-text word search |
## Source Code
The KeyboardRage game source code and visualization themes are at:
**[github.com/EMRD95/keyboardrage](https://github.com/EMRD95/keyboardrage)**
## Regeneration
To rebuild these models from scratch:
```bash
# 1. Rebuild embeddings from merged word lists
cd galaxy && ./rebuild_from_merged_words.sh
# 2. Rebuild semantic index
cd semantic && python build_semantic_index.py
# 3. Precompute neighbors
python precompute_neighbors.py --per-language
# 4. Rebuild 3D projection
cd ../3D_galaxy && ./run_umap50_projection_rebuild.sh
```
All rebuild scripts are in the [GitHub repository](https://github.com/EMRD95/keyboardrage/tree/develop/galaxy).
## License
MIT — same as KeyboardRage.