File size: 6,918 Bytes
ac24d86 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 |
---
language:
- sat
- en
license: mit
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- low-resource
- cross-lingual
- garo
- tibeto-burman
- northeast-india
datasets:
- custom
metrics:
- cosine_similarity
library_name: pytorch
pipeline_tag: sentence-similarity
---
# GaroEmbed: Cross-Lingual Sentence Embeddings for Garo
**GaroEmbed** is the first neural sentence embedding model for Garo (Tibeto-Burman language, ~1.2M speakers in Meghalaya, India). It aligns Garo semantic space with English through contrastive learning, achieving **29.33% Top-1** and **65.33% Top-5** cross-lingual retrieval accuracy.
## Model Description
- **Model Type**: BiLSTM Sentence Encoder with Contrastive Learning
- **Language**: Garo (sat) ↔ English (en)
- **Training Data**: 3,000 Garo-English parallel sentence pairs
- **Base Embeddings**: GaroVec (FastText 300d with char n-grams)
- **Output Dimension**: 384d (aligned with MiniLM)
- **Parameters**: 10.7M
- **Training Time**: ~15 minutes on RTX A4500
## Performance
| Metric | Score |
|--------|-------|
| Top-1 Accuracy | 29.33% |
| Top-5 Accuracy | 65.33% |
| Top-10 Accuracy | 72.67% |
| Mean Reciprocal Rank | 0.4512 |
| Avg Cosine Similarity | 0.3446 |
**88x improvement** over mean-pooled GaroVec baseline (0.33% → 29.33% Top-1).
## Usage
### Requirements
```bash
pip install torch fasttext-wheel sentence-transformers huggingface-hub
```
### Loading the Model
```python
import torch
import torch.nn as nn
import fasttext
from huggingface_hub import hf_hub_download
# Download model checkpoint
checkpoint_path = hf_hub_download(
repo_id="Badnyal/GaroEmbed",
filename="garoembed_best.pt"
)
# Download GaroVec embeddings (required)
garovec_path = hf_hub_download(
repo_id="MWirelabs/GaroVec",
filename="garovec_garo.bin"
)
# Load GaroVec
garo_fasttext = fasttext.load_model(garovec_path)
# Define model architecture (see model_architecture.py in repo)
class GaroEmbed(nn.Module):
def __init__(self, garo_fasttext_model, embedding_dim=300, hidden_dim=512, output_dim=384, dropout=0.3):
super(GaroEmbed, self).__init__()
self.embedding_dim = embedding_dim
self.hidden_dim = hidden_dim
self.output_dim = output_dim
vocab_size = len(garo_fasttext_model.words)
self.embedding = nn.Embedding(vocab_size, embedding_dim)
weights = []
for word in garo_fasttext_model.words:
weights.append(garo_fasttext_model.get_word_vector(word))
weights_tensor = torch.FloatTensor(weights)
self.embedding.weight.data.copy_(weights_tensor)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=2, bidirectional=True, dropout=dropout, batch_first=True)
self.projection = nn.Linear(hidden_dim * 2, output_dim)
self.word2idx = {word: idx for idx, word in enumerate(garo_fasttext_model.words)}
self.fasttext_model = garo_fasttext_model
def tokenize_and_encode(self, sentences):
batch_indices = []
batch_lengths = []
for sentence in sentences:
tokens = sentence.lower().split()
indices = []
for token in tokens:
if token in self.word2idx:
indices.append(self.word2idx[token])
else:
indices.append(0)
if len(indices) == 0:
indices = [0]
batch_indices.append(indices)
batch_lengths.append(len(indices))
return batch_indices, batch_lengths
def forward(self, sentences):
batch_indices, batch_lengths = self.tokenize_and_encode(sentences)
max_len = max(batch_lengths)
device = next(self.parameters()).device
padded = torch.zeros(len(sentences), max_len, dtype=torch.long, device=device)
for i, indices in enumerate(batch_indices):
padded[i, :len(indices)] = torch.LongTensor(indices)
embedded = self.embedding(padded)
packed = nn.utils.rnn.pack_padded_sequence(embedded, batch_lengths, batch_first=True, enforce_sorted=False)
lstm_out, (hidden, cell) = self.lstm(packed)
forward_hidden = hidden[-2]
backward_hidden = hidden[-1]
combined = torch.cat([forward_hidden, backward_hidden], dim=1)
sentence_embedding = self.projection(combined)
sentence_embedding = nn.functional.normalize(sentence_embedding, p=2, dim=1)
return sentence_embedding
# Initialize and load weights
model = GaroEmbed(garo_fasttext, output_dim=384)
checkpoint = torch.load(checkpoint_path, map_location='cpu')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Encode Garo sentences
garo_sentences = [
"Anga namjanika",
"Rikgiparang kamko suala"
]
with torch.no_grad():
embeddings = model(garo_sentences)
print(f"Embeddings shape: {embeddings.shape}") # [2, 384]
```
### Cross-Lingual Retrieval
```python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
# Load English encoder (frozen anchor)
english_encoder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# Encode Garo and English
garo_texts = ["Anga namjanika", "Garo biapni dokana"]
english_texts = ["I feel bad", "About Garo culture", "The weather is nice"]
garo_embeds = model(garo_texts).detach().numpy()
english_embeds = english_encoder.encode(english_texts, normalize_embeddings=True)
# Compute similarities
similarities = cosine_similarity(garo_embeds, english_embeds)
print("Garo-English similarities:")
print(similarities)
```
## Training Details
- **Architecture**: 2-layer BiLSTM (512 hidden units) + Linear projection
- **Loss**: InfoNCE contrastive loss (temperature=0.07)
- **Optimizer**: Adam (lr=2×10⁻⁴)
- **Batch Size**: 32
- **Epochs**: 20
- **Regularization**: Dropout 0.3, frozen GaroVec embeddings
- **English Anchor**: Frozen MiniLM (sentence-transformers/all-MiniLM-L6-v2)
## Limitations
- Trained on only 3,000 parallel pairs (limited semantic coverage)
- Domain: Daily conversation and cultural topics (lacks technical/literary language)
- Orthography: Latin script only
- Morphology: Does not explicitly model Garo's agglutinative structure
- Evaluation: Limited to retrieval tasks
## Acknowledgments
- Built on [GaroVec](https://huggingface.co/MWirelabs/GaroVec) word embeddings
- English anchor: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
- Developed at [MWire Labs](https://mwirelabs.com)
## License
MIT License - Free for research and commercial use
## Contact
- **Author**: Badal Nyalang
- **Organization**: MWire Labs
- **Repository**: [https://huggingface.co/Badnyal/GaroEmbed](https://huggingface.co/Badnyal/GaroEmbed)
---
*First neural sentence embedding model for Garo language • Enabling NLP for low-resource Tibeto-Burman languages*
|