---
language:
- en
- grt
tags:
- embeddings
- bilingual
- garo
- low-resource
license: cc-by-sa-4.0
datasets:
- private
model-index:
- name: GaroVec v1.0
  results: []
---

# GaroVec v1.0 — Hybrid English↔Garo Embeddings

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17083589.svg)](https://doi.org/10.5281/zenodo.17083589)

## Overview
**GaroVec v1.0** is the *first publicly documented Latin-script Garo embedding model*.  
It combines:
- **FastText embeddings** (English + Garo)
- **Cross-lingual alignment** using Procrustes rotation
- **Frequency-based bilingual dictionary** for high-confidence word translations

This hybrid design provides both **semantic embeddings** and **direct dictionary lookups**, making it useful for cross-lingual tasks like translation support, lexicon building, and low-resource NLP research.

---

## Training
- **Data size**: ~2,500 English ↔ Garo parallel sentences  
  (synthetic English generated with open models, manually translated by native Garo speakers)  
- **Method**:
  - FastText skipgram (300-dimensional vectors, char n-grams 3–6, 25 epochs)
  - Linear alignment (Procrustes) between English and Garo vector spaces
  - Frequency-based dictionary extracted from the parallel corpus
- **Artifacts**:
  - `garovec_garo.bin` — Garo FastText embeddings
  - `garovec_english.bin` — English FastText embeddings
  - `garovec_alignment_matrix.npy` — alignment matrix
  - `garovec_model.pkl` — final hybrid model with dictionary + embeddings

---

## Usage

```python
import pickle
import fasttext
import numpy as np

# Load hybrid model data
with open("garovec_model.pkl", "rb") as f:
    garovec_data = pickle.load(f)

# Load embeddings
garo_model = fasttext.load_model("garovec_garo.bin")
english_model = fasttext.load_model("garovec_english.bin")
W = np.load("garovec_alignment_matrix.npy")

# Example: get nearest Garo words for English word
vec = english_model.get_word_vector("love")
aligned_vec = vec @ W
candidates = [garo_model.get_word_vector(w) for w in garo_model.words[:100]]
```

---

## Limitations
- Trained on a **small parallel dataset** (~2.5k sentences).  
- Optimized for **demonstrations, lexicon building, and low-resource NLP exploration**.  
- Future versions will incorporate more data and advanced techniques for broader coverage. 

---

## License
- **Model weights & code**: CC-BY-SA 4.0  
- **Training data**: private, not released

---

## Citation
If you use **GaroVec v1.0**, please cite:

```
MWire Labs. 2025. GaroVec v1.0 — Hybrid English↔Garo Embeddings. Hugging Face.
```
---

## Acknowledgements
Built by **MWire Labs** with contributions from native Garo speakers.  
This release is part of the **NEODAC project** (Northeast India Domain-Adapted Corpus).