Badnyal commited on
Commit
ac24d86
·
verified ·
1 Parent(s): 55641ec

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +205 -0
README.md ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - sat
4
+ - en
5
+ license: mit
6
+ tags:
7
+ - sentence-transformers
8
+ - sentence-similarity
9
+ - feature-extraction
10
+ - low-resource
11
+ - cross-lingual
12
+ - garo
13
+ - tibeto-burman
14
+ - northeast-india
15
+ datasets:
16
+ - custom
17
+ metrics:
18
+ - cosine_similarity
19
+ library_name: pytorch
20
+ pipeline_tag: sentence-similarity
21
+ ---
22
+
23
+ # GaroEmbed: Cross-Lingual Sentence Embeddings for Garo
24
+
25
+ **GaroEmbed** is the first neural sentence embedding model for Garo (Tibeto-Burman language, ~1.2M speakers in Meghalaya, India). It aligns Garo semantic space with English through contrastive learning, achieving **29.33% Top-1** and **65.33% Top-5** cross-lingual retrieval accuracy.
26
+
27
+ ## Model Description
28
+
29
+ - **Model Type**: BiLSTM Sentence Encoder with Contrastive Learning
30
+ - **Language**: Garo (sat) ↔ English (en)
31
+ - **Training Data**: 3,000 Garo-English parallel sentence pairs
32
+ - **Base Embeddings**: GaroVec (FastText 300d with char n-grams)
33
+ - **Output Dimension**: 384d (aligned with MiniLM)
34
+ - **Parameters**: 10.7M
35
+ - **Training Time**: ~15 minutes on RTX A4500
36
+
37
+ ## Performance
38
+
39
+ | Metric | Score |
40
+ |--------|-------|
41
+ | Top-1 Accuracy | 29.33% |
42
+ | Top-5 Accuracy | 65.33% |
43
+ | Top-10 Accuracy | 72.67% |
44
+ | Mean Reciprocal Rank | 0.4512 |
45
+ | Avg Cosine Similarity | 0.3446 |
46
+
47
+ **88x improvement** over mean-pooled GaroVec baseline (0.33% → 29.33% Top-1).
48
+
49
+ ## Usage
50
+
51
+ ### Requirements
52
+ ```bash
53
+ pip install torch fasttext-wheel sentence-transformers huggingface-hub
54
+ ```
55
+
56
+ ### Loading the Model
57
+ ```python
58
+ import torch
59
+ import torch.nn as nn
60
+ import fasttext
61
+ from huggingface_hub import hf_hub_download
62
+
63
+ # Download model checkpoint
64
+ checkpoint_path = hf_hub_download(
65
+ repo_id="Badnyal/GaroEmbed",
66
+ filename="garoembed_best.pt"
67
+ )
68
+
69
+ # Download GaroVec embeddings (required)
70
+ garovec_path = hf_hub_download(
71
+ repo_id="MWirelabs/GaroVec",
72
+ filename="garovec_garo.bin"
73
+ )
74
+
75
+ # Load GaroVec
76
+ garo_fasttext = fasttext.load_model(garovec_path)
77
+
78
+ # Define model architecture (see model_architecture.py in repo)
79
+ class GaroEmbed(nn.Module):
80
+ def __init__(self, garo_fasttext_model, embedding_dim=300, hidden_dim=512, output_dim=384, dropout=0.3):
81
+ super(GaroEmbed, self).__init__()
82
+ self.embedding_dim = embedding_dim
83
+ self.hidden_dim = hidden_dim
84
+ self.output_dim = output_dim
85
+ vocab_size = len(garo_fasttext_model.words)
86
+ self.embedding = nn.Embedding(vocab_size, embedding_dim)
87
+ weights = []
88
+ for word in garo_fasttext_model.words:
89
+ weights.append(garo_fasttext_model.get_word_vector(word))
90
+ weights_tensor = torch.FloatTensor(weights)
91
+ self.embedding.weight.data.copy_(weights_tensor)
92
+ self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=2, bidirectional=True, dropout=dropout, batch_first=True)
93
+ self.projection = nn.Linear(hidden_dim * 2, output_dim)
94
+ self.word2idx = {word: idx for idx, word in enumerate(garo_fasttext_model.words)}
95
+ self.fasttext_model = garo_fasttext_model
96
+
97
+ def tokenize_and_encode(self, sentences):
98
+ batch_indices = []
99
+ batch_lengths = []
100
+ for sentence in sentences:
101
+ tokens = sentence.lower().split()
102
+ indices = []
103
+ for token in tokens:
104
+ if token in self.word2idx:
105
+ indices.append(self.word2idx[token])
106
+ else:
107
+ indices.append(0)
108
+ if len(indices) == 0:
109
+ indices = [0]
110
+ batch_indices.append(indices)
111
+ batch_lengths.append(len(indices))
112
+ return batch_indices, batch_lengths
113
+
114
+ def forward(self, sentences):
115
+ batch_indices, batch_lengths = self.tokenize_and_encode(sentences)
116
+ max_len = max(batch_lengths)
117
+ device = next(self.parameters()).device
118
+ padded = torch.zeros(len(sentences), max_len, dtype=torch.long, device=device)
119
+ for i, indices in enumerate(batch_indices):
120
+ padded[i, :len(indices)] = torch.LongTensor(indices)
121
+ embedded = self.embedding(padded)
122
+ packed = nn.utils.rnn.pack_padded_sequence(embedded, batch_lengths, batch_first=True, enforce_sorted=False)
123
+ lstm_out, (hidden, cell) = self.lstm(packed)
124
+ forward_hidden = hidden[-2]
125
+ backward_hidden = hidden[-1]
126
+ combined = torch.cat([forward_hidden, backward_hidden], dim=1)
127
+ sentence_embedding = self.projection(combined)
128
+ sentence_embedding = nn.functional.normalize(sentence_embedding, p=2, dim=1)
129
+ return sentence_embedding
130
+
131
+ # Initialize and load weights
132
+ model = GaroEmbed(garo_fasttext, output_dim=384)
133
+ checkpoint = torch.load(checkpoint_path, map_location='cpu')
134
+ model.load_state_dict(checkpoint['model_state_dict'])
135
+ model.eval()
136
+
137
+ # Encode Garo sentences
138
+ garo_sentences = [
139
+ "Anga namjanika",
140
+ "Rikgiparang kamko suala"
141
+ ]
142
+
143
+ with torch.no_grad():
144
+ embeddings = model(garo_sentences)
145
+ print(f"Embeddings shape: {embeddings.shape}") # [2, 384]
146
+ ```
147
+
148
+ ### Cross-Lingual Retrieval
149
+ ```python
150
+ from sentence_transformers import SentenceTransformer
151
+ from sklearn.metrics.pairwise import cosine_similarity
152
+
153
+ # Load English encoder (frozen anchor)
154
+ english_encoder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
155
+
156
+ # Encode Garo and English
157
+ garo_texts = ["Anga namjanika", "Garo biapni dokana"]
158
+ english_texts = ["I feel bad", "About Garo culture", "The weather is nice"]
159
+
160
+ garo_embeds = model(garo_texts).detach().numpy()
161
+ english_embeds = english_encoder.encode(english_texts, normalize_embeddings=True)
162
+
163
+ # Compute similarities
164
+ similarities = cosine_similarity(garo_embeds, english_embeds)
165
+ print("Garo-English similarities:")
166
+ print(similarities)
167
+ ```
168
+
169
+ ## Training Details
170
+
171
+ - **Architecture**: 2-layer BiLSTM (512 hidden units) + Linear projection
172
+ - **Loss**: InfoNCE contrastive loss (temperature=0.07)
173
+ - **Optimizer**: Adam (lr=2×10⁻⁴)
174
+ - **Batch Size**: 32
175
+ - **Epochs**: 20
176
+ - **Regularization**: Dropout 0.3, frozen GaroVec embeddings
177
+ - **English Anchor**: Frozen MiniLM (sentence-transformers/all-MiniLM-L6-v2)
178
+
179
+ ## Limitations
180
+
181
+ - Trained on only 3,000 parallel pairs (limited semantic coverage)
182
+ - Domain: Daily conversation and cultural topics (lacks technical/literary language)
183
+ - Orthography: Latin script only
184
+ - Morphology: Does not explicitly model Garo's agglutinative structure
185
+ - Evaluation: Limited to retrieval tasks
186
+
187
+ ## Acknowledgments
188
+
189
+ - Built on [GaroVec](https://huggingface.co/MWirelabs/GaroVec) word embeddings
190
+ - English anchor: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
191
+ - Developed at [MWire Labs](https://mwirelabs.com)
192
+
193
+ ## License
194
+
195
+ MIT License - Free for research and commercial use
196
+
197
+ ## Contact
198
+
199
+ - **Author**: Badal Nyalang
200
+ - **Organization**: MWire Labs
201
+ - **Repository**: [https://huggingface.co/Badnyal/GaroEmbed](https://huggingface.co/Badnyal/GaroEmbed)
202
+
203
+ ---
204
+
205
+ *First neural sentence embedding model for Garo language • Enabling NLP for low-resource Tibeto-Burman languages*