GaroEmbed / README.md

Upload README.md with huggingface_hub

ac24d86 verified 3 months ago

6.92 kB

	---
	language:
	- sat
	- en
	license: mit
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- low-resource
	- cross-lingual
	- garo
	- tibeto-burman
	- northeast-india
	datasets:
	- custom
	metrics:
	- cosine_similarity
	library_name: pytorch
	pipeline_tag: sentence-similarity
	---

	# GaroEmbed: Cross-Lingual Sentence Embeddings for Garo

	GaroEmbed is the first neural sentence embedding model for Garo (Tibeto-Burman language, ~1.2M speakers in Meghalaya, India). It aligns Garo semantic space with English through contrastive learning, achieving 29.33% Top-1 and 65.33% Top-5 cross-lingual retrieval accuracy.

	## Model Description

	- Model Type: BiLSTM Sentence Encoder with Contrastive Learning
	- Language: Garo (sat) ↔ English (en)
	- Training Data: 3,000 Garo-English parallel sentence pairs
	- Base Embeddings: GaroVec (FastText 300d with char n-grams)
	- Output Dimension: 384d (aligned with MiniLM)
	- Parameters: 10.7M
	- Training Time: ~15 minutes on RTX A4500

	## Performance

	\| Metric \| Score \|
	\|--------\|-------\|
	\| Top-1 Accuracy \| 29.33% \|
	\| Top-5 Accuracy \| 65.33% \|
	\| Top-10 Accuracy \| 72.67% \|
	\| Mean Reciprocal Rank \| 0.4512 \|
	\| Avg Cosine Similarity \| 0.3446 \|

	88x improvement over mean-pooled GaroVec baseline (0.33% → 29.33% Top-1).

	## Usage

	### Requirements
	```bash
	pip install torch fasttext-wheel sentence-transformers huggingface-hub
	```

	### Loading the Model
	```python
	import torch
	import torch.nn as nn
	import fasttext
	from huggingface_hub import hf_hub_download

	# Download model checkpoint
	checkpoint_path = hf_hub_download(
	repo_id="Badnyal/GaroEmbed",
	filename="garoembed_best.pt"
	)

	# Download GaroVec embeddings (required)
	garovec_path = hf_hub_download(
	repo_id="MWirelabs/GaroVec",
	filename="garovec_garo.bin"
	)

	# Load GaroVec
	garo_fasttext = fasttext.load_model(garovec_path)

	# Define model architecture (see model_architecture.py in repo)
	class GaroEmbed(nn.Module):
	def __init__(self, garo_fasttext_model, embedding_dim=300, hidden_dim=512, output_dim=384, dropout=0.3):
	super(GaroEmbed, self).__init__()
	self.embedding_dim = embedding_dim
	self.hidden_dim = hidden_dim
	self.output_dim = output_dim
	vocab_size = len(garo_fasttext_model.words)
	self.embedding = nn.Embedding(vocab_size, embedding_dim)
	weights = []
	for word in garo_fasttext_model.words:
	weights.append(garo_fasttext_model.get_word_vector(word))
	weights_tensor = torch.FloatTensor(weights)
	self.embedding.weight.data.copy_(weights_tensor)
	self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=2, bidirectional=True, dropout=dropout, batch_first=True)
	self.projection = nn.Linear(hidden_dim * 2, output_dim)
	self.word2idx = {word: idx for idx, word in enumerate(garo_fasttext_model.words)}
	self.fasttext_model = garo_fasttext_model

	def tokenize_and_encode(self, sentences):
	batch_indices = []
	batch_lengths = []
	for sentence in sentences:
	tokens = sentence.lower().split()
	indices = []
	for token in tokens:
	if token in self.word2idx:
	indices.append(self.word2idx[token])
	else:
	indices.append(0)
	if len(indices) == 0:
	indices = [0]
	batch_indices.append(indices)
	batch_lengths.append(len(indices))
	return batch_indices, batch_lengths

	def forward(self, sentences):
	batch_indices, batch_lengths = self.tokenize_and_encode(sentences)
	max_len = max(batch_lengths)
	device = next(self.parameters()).device
	padded = torch.zeros(len(sentences), max_len, dtype=torch.long, device=device)
	for i, indices in enumerate(batch_indices):
	padded[i, :len(indices)] = torch.LongTensor(indices)
	embedded = self.embedding(padded)
	packed = nn.utils.rnn.pack_padded_sequence(embedded, batch_lengths, batch_first=True, enforce_sorted=False)
	lstm_out, (hidden, cell) = self.lstm(packed)
	forward_hidden = hidden[-2]
	backward_hidden = hidden[-1]
	combined = torch.cat([forward_hidden, backward_hidden], dim=1)
	sentence_embedding = self.projection(combined)
	sentence_embedding = nn.functional.normalize(sentence_embedding, p=2, dim=1)
	return sentence_embedding

	# Initialize and load weights
	model = GaroEmbed(garo_fasttext, output_dim=384)
	checkpoint = torch.load(checkpoint_path, map_location='cpu')
	model.load_state_dict(checkpoint['model_state_dict'])
	model.eval()

	# Encode Garo sentences
	garo_sentences = [
	"Anga namjanika",
	"Rikgiparang kamko suala"
	]

	with torch.no_grad():
	embeddings = model(garo_sentences)
	print(f"Embeddings shape: {embeddings.shape}") # [2, 384]
	```

	### Cross-Lingual Retrieval
	```python
	from sentence_transformers import SentenceTransformer
	from sklearn.metrics.pairwise import cosine_similarity

	# Load English encoder (frozen anchor)
	english_encoder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

	# Encode Garo and English
	garo_texts = ["Anga namjanika", "Garo biapni dokana"]
	english_texts = ["I feel bad", "About Garo culture", "The weather is nice"]

	garo_embeds = model(garo_texts).detach().numpy()
	english_embeds = english_encoder.encode(english_texts, normalize_embeddings=True)

	# Compute similarities
	similarities = cosine_similarity(garo_embeds, english_embeds)
	print("Garo-English similarities:")
	print(similarities)
	```

	## Training Details

	- Architecture: 2-layer BiLSTM (512 hidden units) + Linear projection
	- Loss: InfoNCE contrastive loss (temperature=0.07)
	- Optimizer: Adam (lr=2×10⁻⁴)
	- Batch Size: 32
	- Epochs: 20
	- Regularization: Dropout 0.3, frozen GaroVec embeddings
	- English Anchor: Frozen MiniLM (sentence-transformers/all-MiniLM-L6-v2)

	## Limitations

	- Trained on only 3,000 parallel pairs (limited semantic coverage)
	- Domain: Daily conversation and cultural topics (lacks technical/literary language)
	- Orthography: Latin script only
	- Morphology: Does not explicitly model Garo's agglutinative structure
	- Evaluation: Limited to retrieval tasks

	## Acknowledgments

	- Built on [GaroVec](https://huggingface.co/MWirelabs/GaroVec) word embeddings
	- English anchor: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
	- Developed at [MWire Labs](https://mwirelabs.com)

	## License

	MIT License - Free for research and commercial use

	## Contact

	- Author: Badal Nyalang
	- Organization: MWire Labs
	- Repository: [https://huggingface.co/Badnyal/GaroEmbed](https://huggingface.co/Badnyal/GaroEmbed)

	---

	First neural sentence embedding model for Garo language • Enabling NLP for low-resource Tibeto-Burman languages