Badnyal
/

GaroEmbed

+---
+language:
+- sat
+- en
+license: mit
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- low-resource
+- cross-lingual
+- garo
+- tibeto-burman
+- northeast-india
+datasets:
+- custom
+metrics:
+- cosine_similarity
+library_name: pytorch
+pipeline_tag: sentence-similarity
+---
+# GaroEmbed: Cross-Lingual Sentence Embeddings for Garo
+**GaroEmbed** is the first neural sentence embedding model for Garo (Tibeto-Burman language, ~1.2M speakers in Meghalaya, India). It aligns Garo semantic space with English through contrastive learning, achieving **29.33% Top-1** and **65.33% Top-5** cross-lingual retrieval accuracy.
+## Model Description
+- **Model Type**: BiLSTM Sentence Encoder with Contrastive Learning
+- **Language**: Garo (sat) ↔ English (en)
+- **Training Data**: 3,000 Garo-English parallel sentence pairs
+- **Base Embeddings**: GaroVec (FastText 300d with char n-grams)
+- **Output Dimension**: 384d (aligned with MiniLM)
+- **Parameters**: 10.7M
+- **Training Time**: ~15 minutes on RTX A4500
+## Performance
+| Metric | Score |
+|--------|-------|
+| Top-1 Accuracy | 29.33% |
+| Top-5 Accuracy | 65.33% |
+| Top-10 Accuracy | 72.67% |
+| Mean Reciprocal Rank | 0.4512 |
+| Avg Cosine Similarity | 0.3446 |
+**88x improvement** over mean-pooled GaroVec baseline (0.33% → 29.33% Top-1).
+## Usage
+### Requirements
+```bash
+pip install torch fasttext-wheel sentence-transformers huggingface-hub
+```
+### Loading the Model
+```python
+import torch
+import torch.nn as nn
+import fasttext
+from huggingface_hub import hf_hub_download
+# Download model checkpoint
+checkpoint_path = hf_hub_download(
+    repo_id="Badnyal/GaroEmbed",
+    filename="garoembed_best.pt"
+)
+# Download GaroVec embeddings (required)
+garovec_path = hf_hub_download(
+    repo_id="MWirelabs/GaroVec",
+    filename="garovec_garo.bin"
+)
+# Load GaroVec
+garo_fasttext = fasttext.load_model(garovec_path)
+# Define model architecture (see model_architecture.py in repo)
+class GaroEmbed(nn.Module):
+    def __init__(self, garo_fasttext_model, embedding_dim=300, hidden_dim=512, output_dim=384, dropout=0.3):
+        super(GaroEmbed, self).__init__()
+        self.embedding_dim = embedding_dim
+        self.hidden_dim = hidden_dim
+        self.output_dim = output_dim
+        vocab_size = len(garo_fasttext_model.words)
+        self.embedding = nn.Embedding(vocab_size, embedding_dim)
+        weights = []
+        for word in garo_fasttext_model.words:
+            weights.append(garo_fasttext_model.get_word_vector(word))
+        weights_tensor = torch.FloatTensor(weights)
+        self.embedding.weight.data.copy_(weights_tensor)
+        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=2, bidirectional=True, dropout=dropout, batch_first=True)
+        self.projection = nn.Linear(hidden_dim * 2, output_dim)
+        self.word2idx = {word: idx for idx, word in enumerate(garo_fasttext_model.words)}
+        self.fasttext_model = garo_fasttext_model
+    def tokenize_and_encode(self, sentences):
+        batch_indices = []
+        batch_lengths = []
+        for sentence in sentences:
+            tokens = sentence.lower().split()
+            indices = []
+            for token in tokens:
+                if token in self.word2idx:
+                    indices.append(self.word2idx[token])
+                else:
+                    indices.append(0)
+            if len(indices) == 0:
+                indices = [0]
+            batch_indices.append(indices)
+            batch_lengths.append(len(indices))
+        return batch_indices, batch_lengths
+    def forward(self, sentences):
+        batch_indices, batch_lengths = self.tokenize_and_encode(sentences)
+        max_len = max(batch_lengths)
+        device = next(self.parameters()).device
+        padded = torch.zeros(len(sentences), max_len, dtype=torch.long, device=device)
+        for i, indices in enumerate(batch_indices):
+            padded[i, :len(indices)] = torch.LongTensor(indices)
+        embedded = self.embedding(padded)
+        packed = nn.utils.rnn.pack_padded_sequence(embedded, batch_lengths, batch_first=True, enforce_sorted=False)
+        lstm_out, (hidden, cell) = self.lstm(packed)
+        forward_hidden = hidden[-2]
+        backward_hidden = hidden[-1]
+        combined = torch.cat([forward_hidden, backward_hidden], dim=1)
+        sentence_embedding = self.projection(combined)
+        sentence_embedding = nn.functional.normalize(sentence_embedding, p=2, dim=1)
+        return sentence_embedding
+# Initialize and load weights
+model = GaroEmbed(garo_fasttext, output_dim=384)
+checkpoint = torch.load(checkpoint_path, map_location='cpu')
+model.load_state_dict(checkpoint['model_state_dict'])
+model.eval()
+# Encode Garo sentences
+garo_sentences = [
+    "Anga namjanika",
+    "Rikgiparang kamko suala"
+]
+with torch.no_grad():
+    embeddings = model(garo_sentences)
+    print(f"Embeddings shape: {embeddings.shape}")  # [2, 384]
+```
+### Cross-Lingual Retrieval
+```python
+from sentence_transformers import SentenceTransformer
+from sklearn.metrics.pairwise import cosine_similarity
+# Load English encoder (frozen anchor)
+english_encoder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
+# Encode Garo and English
+garo_texts = ["Anga namjanika", "Garo biapni dokana"]
+english_texts = ["I feel bad", "About Garo culture", "The weather is nice"]
+garo_embeds = model(garo_texts).detach().numpy()
+english_embeds = english_encoder.encode(english_texts, normalize_embeddings=True)
+# Compute similarities
+similarities = cosine_similarity(garo_embeds, english_embeds)
+print("Garo-English similarities:")
+print(similarities)
+```
+## Training Details
+- **Architecture**: 2-layer BiLSTM (512 hidden units) + Linear projection
+- **Loss**: InfoNCE contrastive loss (temperature=0.07)
+- **Optimizer**: Adam (lr=2×10⁻⁴)
+- **Batch Size**: 32
+- **Epochs**: 20
+- **Regularization**: Dropout 0.3, frozen GaroVec embeddings
+- **English Anchor**: Frozen MiniLM (sentence-transformers/all-MiniLM-L6-v2)
+## Limitations
+- Trained on only 3,000 parallel pairs (limited semantic coverage)
+- Domain: Daily conversation and cultural topics (lacks technical/literary language)
+- Orthography: Latin script only
+- Morphology: Does not explicitly model Garo's agglutinative structure
+- Evaluation: Limited to retrieval tasks
+## Acknowledgments
+- Built on [GaroVec](https://huggingface.co/MWirelabs/GaroVec) word embeddings
+- English anchor: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
+- Developed at [MWire Labs](https://mwirelabs.com)
+## License
+MIT License - Free for research and commercial use
+## Contact
+- **Author**: Badal Nyalang
+- **Organization**: MWire Labs
+- **Repository**: [https://huggingface.co/Badnyal/GaroEmbed](https://huggingface.co/Badnyal/GaroEmbed)
+---
+*First neural sentence embedding model for Garo language • Enabling NLP for low-resource Tibeto-Burman languages*