|
|
--- |
|
|
language: |
|
|
- sat |
|
|
- en |
|
|
license: mit |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- sentence-similarity |
|
|
- feature-extraction |
|
|
- low-resource |
|
|
- cross-lingual |
|
|
- garo |
|
|
- tibeto-burman |
|
|
- northeast-india |
|
|
datasets: |
|
|
- custom |
|
|
metrics: |
|
|
- cosine_similarity |
|
|
library_name: pytorch |
|
|
pipeline_tag: sentence-similarity |
|
|
--- |
|
|
|
|
|
# GaroEmbed: Cross-Lingual Sentence Embeddings for Garo |
|
|
|
|
|
**GaroEmbed** is the first neural sentence embedding model for Garo (Tibeto-Burman language, ~1.2M speakers in Meghalaya, India). It aligns Garo semantic space with English through contrastive learning, achieving **29.33% Top-1** and **65.33% Top-5** cross-lingual retrieval accuracy. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Model Type**: BiLSTM Sentence Encoder with Contrastive Learning |
|
|
- **Language**: Garo (sat) ↔ English (en) |
|
|
- **Training Data**: 3,000 Garo-English parallel sentence pairs |
|
|
- **Base Embeddings**: GaroVec (FastText 300d with char n-grams) |
|
|
- **Output Dimension**: 384d (aligned with MiniLM) |
|
|
- **Parameters**: 10.7M |
|
|
- **Training Time**: ~15 minutes on RTX A4500 |
|
|
|
|
|
## Performance |
|
|
|
|
|
| Metric | Score | |
|
|
|--------|-------| |
|
|
| Top-1 Accuracy | 29.33% | |
|
|
| Top-5 Accuracy | 65.33% | |
|
|
| Top-10 Accuracy | 72.67% | |
|
|
| Mean Reciprocal Rank | 0.4512 | |
|
|
| Avg Cosine Similarity | 0.3446 | |
|
|
|
|
|
**88x improvement** over mean-pooled GaroVec baseline (0.33% → 29.33% Top-1). |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Requirements |
|
|
```bash |
|
|
pip install torch fasttext-wheel sentence-transformers huggingface-hub |
|
|
``` |
|
|
|
|
|
### Loading the Model |
|
|
```python |
|
|
import torch |
|
|
import torch.nn as nn |
|
|
import fasttext |
|
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
# Download model checkpoint |
|
|
checkpoint_path = hf_hub_download( |
|
|
repo_id="Badnyal/GaroEmbed", |
|
|
filename="garoembed_best.pt" |
|
|
) |
|
|
|
|
|
# Download GaroVec embeddings (required) |
|
|
garovec_path = hf_hub_download( |
|
|
repo_id="MWirelabs/GaroVec", |
|
|
filename="garovec_garo.bin" |
|
|
) |
|
|
|
|
|
# Load GaroVec |
|
|
garo_fasttext = fasttext.load_model(garovec_path) |
|
|
|
|
|
# Define model architecture (see model_architecture.py in repo) |
|
|
class GaroEmbed(nn.Module): |
|
|
def __init__(self, garo_fasttext_model, embedding_dim=300, hidden_dim=512, output_dim=384, dropout=0.3): |
|
|
super(GaroEmbed, self).__init__() |
|
|
self.embedding_dim = embedding_dim |
|
|
self.hidden_dim = hidden_dim |
|
|
self.output_dim = output_dim |
|
|
vocab_size = len(garo_fasttext_model.words) |
|
|
self.embedding = nn.Embedding(vocab_size, embedding_dim) |
|
|
weights = [] |
|
|
for word in garo_fasttext_model.words: |
|
|
weights.append(garo_fasttext_model.get_word_vector(word)) |
|
|
weights_tensor = torch.FloatTensor(weights) |
|
|
self.embedding.weight.data.copy_(weights_tensor) |
|
|
self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=2, bidirectional=True, dropout=dropout, batch_first=True) |
|
|
self.projection = nn.Linear(hidden_dim * 2, output_dim) |
|
|
self.word2idx = {word: idx for idx, word in enumerate(garo_fasttext_model.words)} |
|
|
self.fasttext_model = garo_fasttext_model |
|
|
|
|
|
def tokenize_and_encode(self, sentences): |
|
|
batch_indices = [] |
|
|
batch_lengths = [] |
|
|
for sentence in sentences: |
|
|
tokens = sentence.lower().split() |
|
|
indices = [] |
|
|
for token in tokens: |
|
|
if token in self.word2idx: |
|
|
indices.append(self.word2idx[token]) |
|
|
else: |
|
|
indices.append(0) |
|
|
if len(indices) == 0: |
|
|
indices = [0] |
|
|
batch_indices.append(indices) |
|
|
batch_lengths.append(len(indices)) |
|
|
return batch_indices, batch_lengths |
|
|
|
|
|
def forward(self, sentences): |
|
|
batch_indices, batch_lengths = self.tokenize_and_encode(sentences) |
|
|
max_len = max(batch_lengths) |
|
|
device = next(self.parameters()).device |
|
|
padded = torch.zeros(len(sentences), max_len, dtype=torch.long, device=device) |
|
|
for i, indices in enumerate(batch_indices): |
|
|
padded[i, :len(indices)] = torch.LongTensor(indices) |
|
|
embedded = self.embedding(padded) |
|
|
packed = nn.utils.rnn.pack_padded_sequence(embedded, batch_lengths, batch_first=True, enforce_sorted=False) |
|
|
lstm_out, (hidden, cell) = self.lstm(packed) |
|
|
forward_hidden = hidden[-2] |
|
|
backward_hidden = hidden[-1] |
|
|
combined = torch.cat([forward_hidden, backward_hidden], dim=1) |
|
|
sentence_embedding = self.projection(combined) |
|
|
sentence_embedding = nn.functional.normalize(sentence_embedding, p=2, dim=1) |
|
|
return sentence_embedding |
|
|
|
|
|
# Initialize and load weights |
|
|
model = GaroEmbed(garo_fasttext, output_dim=384) |
|
|
checkpoint = torch.load(checkpoint_path, map_location='cpu') |
|
|
model.load_state_dict(checkpoint['model_state_dict']) |
|
|
model.eval() |
|
|
|
|
|
# Encode Garo sentences |
|
|
garo_sentences = [ |
|
|
"Anga namjanika", |
|
|
"Rikgiparang kamko suala" |
|
|
] |
|
|
|
|
|
with torch.no_grad(): |
|
|
embeddings = model(garo_sentences) |
|
|
print(f"Embeddings shape: {embeddings.shape}") # [2, 384] |
|
|
``` |
|
|
|
|
|
### Cross-Lingual Retrieval |
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
from sklearn.metrics.pairwise import cosine_similarity |
|
|
|
|
|
# Load English encoder (frozen anchor) |
|
|
english_encoder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') |
|
|
|
|
|
# Encode Garo and English |
|
|
garo_texts = ["Anga namjanika", "Garo biapni dokana"] |
|
|
english_texts = ["I feel bad", "About Garo culture", "The weather is nice"] |
|
|
|
|
|
garo_embeds = model(garo_texts).detach().numpy() |
|
|
english_embeds = english_encoder.encode(english_texts, normalize_embeddings=True) |
|
|
|
|
|
# Compute similarities |
|
|
similarities = cosine_similarity(garo_embeds, english_embeds) |
|
|
print("Garo-English similarities:") |
|
|
print(similarities) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Architecture**: 2-layer BiLSTM (512 hidden units) + Linear projection |
|
|
- **Loss**: InfoNCE contrastive loss (temperature=0.07) |
|
|
- **Optimizer**: Adam (lr=2×10⁻⁴) |
|
|
- **Batch Size**: 32 |
|
|
- **Epochs**: 20 |
|
|
- **Regularization**: Dropout 0.3, frozen GaroVec embeddings |
|
|
- **English Anchor**: Frozen MiniLM (sentence-transformers/all-MiniLM-L6-v2) |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Trained on only 3,000 parallel pairs (limited semantic coverage) |
|
|
- Domain: Daily conversation and cultural topics (lacks technical/literary language) |
|
|
- Orthography: Latin script only |
|
|
- Morphology: Does not explicitly model Garo's agglutinative structure |
|
|
- Evaluation: Limited to retrieval tasks |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- Built on [GaroVec](https://huggingface.co/MWirelabs/GaroVec) word embeddings |
|
|
- English anchor: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) |
|
|
- Developed at [MWire Labs](https://mwirelabs.com) |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License - Free for research and commercial use |
|
|
|
|
|
## Contact |
|
|
|
|
|
- **Author**: Badal Nyalang |
|
|
- **Organization**: MWire Labs |
|
|
- **Repository**: [https://huggingface.co/Badnyal/GaroEmbed](https://huggingface.co/Badnyal/GaroEmbed) |
|
|
|
|
|
--- |
|
|
|
|
|
*First neural sentence embedding model for Garo language • Enabling NLP for low-resource Tibeto-Burman languages* |
|
|
|