--- language: - en - grt tags: - embeddings - bilingual - garo - low-resource license: cc-by-sa-4.0 datasets: - private model-index: - name: GaroVec v1.0 results: [] --- # GaroVec v1.0 — Hybrid English↔Garo Embeddings [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17083589.svg)](https://doi.org/10.5281/zenodo.17083589) ## Overview **GaroVec v1.0** is the *first publicly documented Latin-script Garo embedding model*. It combines: - **FastText embeddings** (English + Garo) - **Cross-lingual alignment** using Procrustes rotation - **Frequency-based bilingual dictionary** for high-confidence word translations This hybrid design provides both **semantic embeddings** and **direct dictionary lookups**, making it useful for cross-lingual tasks like translation support, lexicon building, and low-resource NLP research. --- ## Training - **Data size**: ~2,500 English ↔ Garo parallel sentences (synthetic English generated with open models, manually translated by native Garo speakers) - **Method**: - FastText skipgram (300-dimensional vectors, char n-grams 3–6, 25 epochs) - Linear alignment (Procrustes) between English and Garo vector spaces - Frequency-based dictionary extracted from the parallel corpus - **Artifacts**: - `garovec_garo.bin` — Garo FastText embeddings - `garovec_english.bin` — English FastText embeddings - `garovec_alignment_matrix.npy` — alignment matrix - `garovec_model.pkl` — final hybrid model with dictionary + embeddings --- ## Usage ```python import pickle import fasttext import numpy as np # Load hybrid model data with open("garovec_model.pkl", "rb") as f: garovec_data = pickle.load(f) # Load embeddings garo_model = fasttext.load_model("garovec_garo.bin") english_model = fasttext.load_model("garovec_english.bin") W = np.load("garovec_alignment_matrix.npy") # Example: get nearest Garo words for English word vec = english_model.get_word_vector("love") aligned_vec = vec @ W candidates = [garo_model.get_word_vector(w) for w in garo_model.words[:100]] ``` --- ## Limitations - Trained on a **small parallel dataset** (~2.5k sentences). - Optimized for **demonstrations, lexicon building, and low-resource NLP exploration**. - Future versions will incorporate more data and advanced techniques for broader coverage. --- ## License - **Model weights & code**: CC-BY-SA 4.0 - **Training data**: private, not released --- ## Citation If you use **GaroVec v1.0**, please cite: ``` MWire Labs. 2025. GaroVec v1.0 — Hybrid English↔Garo Embeddings. Hugging Face. ``` --- ## Acknowledgements Built by **MWire Labs** with contributions from native Garo speakers. This release is part of the **NEODAC project** (Northeast India Domain-Adapted Corpus).