KeyboardRage Semantic Models

Precomputed multilingual semantic embeddings and neighbor indices for the KeyboardRage typing game's Galaxy visualization.

Overview

This repository contains the trained semantic models that power the 3D semantic word galaxy in KeyboardRage. 2,250,636 words across 108 languages are embedded into a 384-dimensional space using IBM's Granite multilingual embedding model, then projected to 3D via UMAP. Precomputed nearest-neighbor indices enable real-time similarity queries.

Contents

Global embeddings & index (15 GB)

File Size Description
semantic_embeddings.f32.npy 3.3 GB Float32 embeddings for all 2.25M words (384-dim, inner-product normalized)
semantic_faiss_hnsw.index 3.3 GB FAISS flat inner-product index (exact cosine similarity over normalized vectors)
neighbor_ids.npy 1.7 GB Precomputed global top-200 neighbor IDs (rows Γ— 200, int64)
neighbor_scores.npy 1.7 GB Precomputed global top-200 cosine similarity scores (float32)
semantic_index_meta.json ~1 KB Model metadata (embedding model, dimensions, row count)

Per-language neighbor indices

108 languages, each with 4 files:

  • neighbor_ids_{lang}.npy β€” precomputed within-language top-200 neighbor IDs
  • neighbor_scores_{lang}.npy β€” cosine similarity scores
  • lang_index_{lang}.npy β€” global-ID β†’ local-ID mapping
  • neighbor_meta_{lang}.json β€” per-language statistics

3D projection metadata

File Size Description
atlas_data.parquet 88 MB Word metadata: 3D UMAP coordinates (x, y, z), word, language, definition

Raw word embeddings (834 MB)

words_emb_merged/{lang}.json β€” raw embedding vectors per language, used for regeneration workflows.

Model Details

  • Embedding model: ibm-granite/granite-embedding-97m-multilingual-r2
  • Embedding dimension: 384
  • Total words embedded: 2,250,636
  • Languages: 108
  • Similarity metric: Cosine similarity via normalized inner product
  • 3D projection: UMAP (n_components=3, metric='cosine')
  • Neighbor count: Top 200 per word (global + per-language)

108 Supported Languages

afrikaans, albanian, amharic, arabic (+egypt, +morocco), armenian (+western), azerbaijani, bangla, bashkir, belarusian (+lacinka), bosnian, bulgarian, catalan, chinese_simplified, chinese_traditional, croatian, czech, danish, dutch, english, esperanto (+h_sistemo, +x_sistemo), estonian, euskera, filipino, finnish, french, friulian, galician, georgian, german, greek, gujarati, hausa, hawaiian, hebrew, hindi, hungarian, icelandic, indonesian, irish, italian, japanese (hiragana, katakana, romaji), kannada, kazakh, khmer, korean, kyrgyz, lao, latin, latvian, lithuanian, macedonian, malagasy, malay, malayalam, maltese, marathi, mongolian, myanmar, nepali, norwegian_nynorsk, occitan, oromo, pashto, persian, polish, portuguese (+acentos_e_cedilha), romanian, russian, sanskrit, santali, serbian (+latin), shona, sinhala, slovak, slovenian, spanish, swahili, swedish, swiss_german, tamil, tatar (+crimean, +crimean_cyrillic), telugu, thai, tibetan, turkish, udmurt, ukrainian (+latynka), urdu, uzbek, vietnamese, welsh, xhosa, yiddish, yoruba, zulu

On-Premise Deployment

Prerequisites

  • Python 3.10+
  • FastAPI, NumPy, DuckDB, PyArrow

Quick start

# 1. Clone the game code
git clone https://github.com/EMRD95/keyboardrage
cd keyboardrage

# 2. Download models from HuggingFace
./setup.sh

# 3. Run the semantic neighbors API
cd galaxy/semantic
pip install fastapi uvicorn numpy duckdb pyarrow
python semantic_neighbors_server.py
# API available at http://localhost:8703

API Endpoints

Endpoint Description
GET /health Server status, available languages, row count
GET /point/{id} Get word metadata by global ID
GET /neighbors/{id}?k=10&language=french Get nearest neighbors (global or per-language)
GET /search?q=mot&language=french Full-text word search

Source Code

The KeyboardRage game source code and visualization themes are at: github.com/EMRD95/keyboardrage

Regeneration

To rebuild these models from scratch:

# 1. Rebuild embeddings from merged word lists
cd galaxy && ./rebuild_from_merged_words.sh

# 2. Rebuild semantic index
cd semantic && python build_semantic_index.py

# 3. Precompute neighbors
python precompute_neighbors.py --per-language

# 4. Rebuild 3D projection
cd ../3D_galaxy && ./run_umap50_projection_rebuild.sh

All rebuild scripts are in the GitHub repository.

License

MIT β€” same as KeyboardRage.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support