Upload README.md with huggingface_hub

0a81147 verified 1 day ago

5.55 kB

	---
	language:
	- af
	- am
	- ar
	- hy
	- az
	- bn
	- ba
	- be
	- bs
	- bg
	- ca
	- zh
	- hr
	- cs
	- da
	- nl
	- en
	- eo
	- et
	- eu
	- fil
	- fi
	- fr
	- fur
	- gl
	- ka
	- de
	- el
	- gu
	- ha
	- haw
	- he
	- hi
	- hu
	- is
	- id
	- ga
	- it
	- ja
	- kn
	- kk
	- km
	- ko
	- ky
	- lo
	- la
	- lv
	- lt
	- mk
	- mg
	- ms
	- ml
	- mt
	- mr
	- mn
	- my
	- ne
	- nn
	- oc
	- om
	- ps
	- fa
	- pl
	- pt
	- ro
	- ru
	- sa
	- sat
	- sr
	- sn
	- si
	- sk
	- sl
	- es
	- sw
	- sv
	- gsw
	- ta
	- tt
	- te
	- th
	- bo
	- tr
	- udm
	- uk
	- ur
	- uz
	- vi
	- cy
	- xh
	- yi
	- yo
	- zu
	tags:
	- keyboardrage
	- semantic-search
	- embeddings
	- typing-game
	- multilingual
	- faiss
	- granite-embedding
	license: mit
	datasets:
	- wiktionary
	- monkeytype
	---

	# KeyboardRage Semantic Models

	Precomputed multilingual semantic embeddings and neighbor indices for the [KeyboardRage](https://github.com/EMRD95/keyboardrage) typing game's Galaxy visualization.

	## Overview

	This repository contains the trained semantic models that power the 3D semantic word galaxy in KeyboardRage. 2,250,636 words across 108 languages are embedded into a 384-dimensional space using IBM's Granite multilingual embedding model, then projected to 3D via UMAP. Precomputed nearest-neighbor indices enable real-time similarity queries.

	## Contents

	### Global embeddings & index (15 GB)
	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `semantic_embeddings.f32.npy` \| 3.3 GB \| Float32 embeddings for all 2.25M words (384-dim, inner-product normalized) \|
	\| `semantic_faiss_hnsw.index` \| 3.3 GB \| FAISS flat inner-product index (exact cosine similarity over normalized vectors) \|
	\| `neighbor_ids.npy` \| 1.7 GB \| Precomputed global top-200 neighbor IDs (rows × 200, int64) \|
	\| `neighbor_scores.npy` \| 1.7 GB \| Precomputed global top-200 cosine similarity scores (float32) \|
	\| `semantic_index_meta.json` \| ~1 KB \| Model metadata (embedding model, dimensions, row count) \|

	### Per-language neighbor indices
	108 languages, each with 4 files:
	- `neighbor_ids_{lang}.npy` — precomputed within-language top-200 neighbor IDs
	- `neighbor_scores_{lang}.npy` — cosine similarity scores
	- `lang_index_{lang}.npy` — global-ID → local-ID mapping
	- `neighbor_meta_{lang}.json` — per-language statistics

	### 3D projection metadata
	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `atlas_data.parquet` \| 88 MB \| Word metadata: 3D UMAP coordinates (x, y, z), word, language, definition \|

	### Raw word embeddings (834 MB)
	`words_emb_merged/{lang}.json` — raw embedding vectors per language, used for regeneration workflows.

	## Model Details

	- Embedding model: [ibm-granite/granite-embedding-97m-multilingual-r2](https://huggingface.co/ibm-granite/granite-embedding-97m-multilingual-r2)
	- Embedding dimension: 384
	- Total words embedded: 2,250,636
	- Languages: 108
	- Similarity metric: Cosine similarity via normalized inner product
	- 3D projection: UMAP (n_components=3, metric='cosine')
	- Neighbor count: Top 200 per word (global + per-language)

	## 108 Supported Languages

	afrikaans, albanian, amharic, arabic (+egypt, +morocco), armenian (+western), azerbaijani, bangla, bashkir, belarusian (+lacinka), bosnian, bulgarian, catalan, chinese_simplified, chinese_traditional, croatian, czech, danish, dutch, english, esperanto (+h_sistemo, +x_sistemo), estonian, euskera, filipino, finnish, french, friulian, galician, georgian, german, greek, gujarati, hausa, hawaiian, hebrew, hindi, hungarian, icelandic, indonesian, irish, italian, japanese (hiragana, katakana, romaji), kannada, kazakh, khmer, korean, kyrgyz, lao, latin, latvian, lithuanian, macedonian, malagasy, malay, malayalam, maltese, marathi, mongolian, myanmar, nepali, norwegian_nynorsk, occitan, oromo, pashto, persian, polish, portuguese (+acentos_e_cedilha), romanian, russian, sanskrit, santali, serbian (+latin), shona, sinhala, slovak, slovenian, spanish, swahili, swedish, swiss_german, tamil, tatar (+crimean, +crimean_cyrillic), telugu, thai, tibetan, turkish, udmurt, ukrainian (+latynka), urdu, uzbek, vietnamese, welsh, xhosa, yiddish, yoruba, zulu

	## On-Premise Deployment

	### Prerequisites
	- Python 3.10+
	- FastAPI, NumPy, DuckDB, PyArrow

	### Quick start

	```bash
	# 1. Clone the game code
	git clone https://github.com/EMRD95/keyboardrage
	cd keyboardrage

	# 2. Download models from HuggingFace
	./setup.sh

	# 3. Run the semantic neighbors API
	cd galaxy/semantic
	pip install fastapi uvicorn numpy duckdb pyarrow
	python semantic_neighbors_server.py
	# API available at http://localhost:8703
	```

	### API Endpoints

	\| Endpoint \| Description \|
	\|----------\|-------------\|
	\| `GET /health` \| Server status, available languages, row count \|
	\| `GET /point/{id}` \| Get word metadata by global ID \|
	\| `GET /neighbors/{id}?k=10&language=french` \| Get nearest neighbors (global or per-language) \|
	\| `GET /search?q=mot&language=french` \| Full-text word search \|

	## Source Code

	The KeyboardRage game source code and visualization themes are at:
	[github.com/EMRD95/keyboardrage](https://github.com/EMRD95/keyboardrage)

	## Regeneration

	To rebuild these models from scratch:

	```bash
	# 1. Rebuild embeddings from merged word lists
	cd galaxy && ./rebuild_from_merged_words.sh

	# 2. Rebuild semantic index
	cd semantic && python build_semantic_index.py

	# 3. Precompute neighbors
	python precompute_neighbors.py --per-language

	# 4. Rebuild 3D projection
	cd ../3D_galaxy && ./run_umap50_projection_rebuild.sh
	```

	All rebuild scripts are in the [GitHub repository](https://github.com/EMRD95/keyboardrage/tree/develop/galaxy).

	## License

	MIT — same as KeyboardRage.