File size: 6,918 Bytes
ac24d86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
---
language:
- sat
- en
license: mit
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- low-resource
- cross-lingual
- garo
- tibeto-burman
- northeast-india
datasets:
- custom
metrics:
- cosine_similarity
library_name: pytorch
pipeline_tag: sentence-similarity
---

# GaroEmbed: Cross-Lingual Sentence Embeddings for Garo

**GaroEmbed** is the first neural sentence embedding model for Garo (Tibeto-Burman language, ~1.2M speakers in Meghalaya, India). It aligns Garo semantic space with English through contrastive learning, achieving **29.33% Top-1** and **65.33% Top-5** cross-lingual retrieval accuracy.

## Model Description

- **Model Type**: BiLSTM Sentence Encoder with Contrastive Learning
- **Language**: Garo (sat) ↔ English (en)
- **Training Data**: 3,000 Garo-English parallel sentence pairs
- **Base Embeddings**: GaroVec (FastText 300d with char n-grams)
- **Output Dimension**: 384d (aligned with MiniLM)
- **Parameters**: 10.7M
- **Training Time**: ~15 minutes on RTX A4500

## Performance

| Metric | Score |
|--------|-------|
| Top-1 Accuracy | 29.33% |
| Top-5 Accuracy | 65.33% |
| Top-10 Accuracy | 72.67% |
| Mean Reciprocal Rank | 0.4512 |
| Avg Cosine Similarity | 0.3446 |

**88x improvement** over mean-pooled GaroVec baseline (0.33% → 29.33% Top-1).

## Usage

### Requirements
```bash
pip install torch fasttext-wheel sentence-transformers huggingface-hub
```

### Loading the Model
```python
import torch
import torch.nn as nn
import fasttext
from huggingface_hub import hf_hub_download

# Download model checkpoint
checkpoint_path = hf_hub_download(
    repo_id="Badnyal/GaroEmbed",
    filename="garoembed_best.pt"
)

# Download GaroVec embeddings (required)
garovec_path = hf_hub_download(
    repo_id="MWirelabs/GaroVec",
    filename="garovec_garo.bin"
)

# Load GaroVec
garo_fasttext = fasttext.load_model(garovec_path)

# Define model architecture (see model_architecture.py in repo)
class GaroEmbed(nn.Module):
    def __init__(self, garo_fasttext_model, embedding_dim=300, hidden_dim=512, output_dim=384, dropout=0.3):
        super(GaroEmbed, self).__init__()
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim
        vocab_size = len(garo_fasttext_model.words)
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        weights = []
        for word in garo_fasttext_model.words:
            weights.append(garo_fasttext_model.get_word_vector(word))
        weights_tensor = torch.FloatTensor(weights)
        self.embedding.weight.data.copy_(weights_tensor)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=2, bidirectional=True, dropout=dropout, batch_first=True)
        self.projection = nn.Linear(hidden_dim * 2, output_dim)
        self.word2idx = {word: idx for idx, word in enumerate(garo_fasttext_model.words)}
        self.fasttext_model = garo_fasttext_model

    def tokenize_and_encode(self, sentences):
        batch_indices = []
        batch_lengths = []
        for sentence in sentences:
            tokens = sentence.lower().split()
            indices = []
            for token in tokens:
                if token in self.word2idx:
                    indices.append(self.word2idx[token])
                else:
                    indices.append(0)
            if len(indices) == 0:
                indices = [0]
            batch_indices.append(indices)
            batch_lengths.append(len(indices))
        return batch_indices, batch_lengths

    def forward(self, sentences):
        batch_indices, batch_lengths = self.tokenize_and_encode(sentences)
        max_len = max(batch_lengths)
        device = next(self.parameters()).device
        padded = torch.zeros(len(sentences), max_len, dtype=torch.long, device=device)
        for i, indices in enumerate(batch_indices):
            padded[i, :len(indices)] = torch.LongTensor(indices)
        embedded = self.embedding(padded)
        packed = nn.utils.rnn.pack_padded_sequence(embedded, batch_lengths, batch_first=True, enforce_sorted=False)
        lstm_out, (hidden, cell) = self.lstm(packed)
        forward_hidden = hidden[-2]
        backward_hidden = hidden[-1]
        combined = torch.cat([forward_hidden, backward_hidden], dim=1)
        sentence_embedding = self.projection(combined)
        sentence_embedding = nn.functional.normalize(sentence_embedding, p=2, dim=1)
        return sentence_embedding

# Initialize and load weights
model = GaroEmbed(garo_fasttext, output_dim=384)
checkpoint = torch.load(checkpoint_path, map_location='cpu')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Encode Garo sentences
garo_sentences = [
    "Anga namjanika",
    "Rikgiparang kamko suala"
]

with torch.no_grad():
    embeddings = model(garo_sentences)
    print(f"Embeddings shape: {embeddings.shape}")  # [2, 384]
```

### Cross-Lingual Retrieval
```python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load English encoder (frozen anchor)
english_encoder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Encode Garo and English
garo_texts = ["Anga namjanika", "Garo biapni dokana"]
english_texts = ["I feel bad", "About Garo culture", "The weather is nice"]

garo_embeds = model(garo_texts).detach().numpy()
english_embeds = english_encoder.encode(english_texts, normalize_embeddings=True)

# Compute similarities
similarities = cosine_similarity(garo_embeds, english_embeds)
print("Garo-English similarities:")
print(similarities)
```

## Training Details

- **Architecture**: 2-layer BiLSTM (512 hidden units) + Linear projection
- **Loss**: InfoNCE contrastive loss (temperature=0.07)
- **Optimizer**: Adam (lr=2×10⁻⁴)
- **Batch Size**: 32
- **Epochs**: 20
- **Regularization**: Dropout 0.3, frozen GaroVec embeddings
- **English Anchor**: Frozen MiniLM (sentence-transformers/all-MiniLM-L6-v2)

## Limitations

- Trained on only 3,000 parallel pairs (limited semantic coverage)
- Domain: Daily conversation and cultural topics (lacks technical/literary language)
- Orthography: Latin script only
- Morphology: Does not explicitly model Garo's agglutinative structure
- Evaluation: Limited to retrieval tasks

## Acknowledgments

- Built on [GaroVec](https://huggingface.co/MWirelabs/GaroVec) word embeddings
- English anchor: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
- Developed at [MWire Labs](https://mwirelabs.com)

## License

MIT License - Free for research and commercial use

## Contact

- **Author**: Badal Nyalang
- **Organization**: MWire Labs
- **Repository**: [https://huggingface.co/Badnyal/GaroEmbed](https://huggingface.co/Badnyal/GaroEmbed)

---

*First neural sentence embedding model for Garo language • Enabling NLP for low-resource Tibeto-Burman languages*