bambara-embeddings / README.md

Update README.md

8ab8ba5 verified 2 days ago

5.2 kB

	---
	language: bm
	tags:
	- bambara
	- fasttext
	- embeddings
	- word-vectors
	- african-nlp
	- low-resource
	license: apache-2.0
	datasets:
	- bambara-corpus
	metrics:
	- cosine_similarity
	pipeline_tag: feature-extraction
	---

	# Bambara FastText Embeddings

	## Model Description

	This model provides FastText word embeddings for the Bambara language (Bamanankan), a Mande language spoken primarily in Mali. The embeddings capture semantic relationships between Bambara words and enable various NLP tasks for this low-resource African language.

	Model Type: FastText Word Embeddings
	Language: Bambara (bm)
	License: Apache 2.0

	## Model Details

	### Model Architecture
	- Algorithm: FastText with subword information
	- Vector Dimension: 300
	- Vocabulary Size: 9,973 unique Bambara words
	- Training Method: Skip-gram with negative sampling
	- Subword Information: Character n-grams (enables handling of out-of-vocabulary words)

	### Training Data

	The model was trained on Bambara text corpora, building upon the work of [David Ifeoluwa Adelani's PhD dissertation](https://arxiv.org/abs/2507.00297) on natural language processing for African languages.

	### Intended Use

	This model is designed for:

	- Semantic similarity tasks in Bambara
	- Information retrieval for Bambara documents
	- Cross-lingual research involving Bambara
	- Cultural preservation and digital humanities projects
	- Educational applications for Bambara language learning
	- Foundation for downstream NLP tasks in Bambara

	## Installation

	```bash
	pip install gensim huggingface_hub scikit-learn numpy
	```

	## Usage

	### Load the Model

	```python
	import tempfile
	from gensim.models import KeyedVectors
	from huggingface_hub import hf_hub_download

	model_id = "MALIBA-AI/bambara-fasttext"

	# Download model files
	model_path = hf_hub_download(repo_id=model_id, filename="bam.bin", cache_dir=tempfile.gettempdir())
	vectors_path = hf_hub_download(repo_id=model_id, filename="bam.bin.vectors_ngrams.npy", cache_dir=tempfile.gettempdir())

	# Load model
	model = KeyedVectors.load(model_path)

	print(f"Vocabulary size: {len(model.key_to_index)}")
	print(f"Vector dimension: {model.vector_size}")
	```

	### Get a Word Vector

	```python
	vector = model["bamako"]
	print(f"Shape: {vector.shape}") # (300,)
	```

	### Find Similar Words

	```python
	similar_words = model.most_similar("dumuni", topn=10)
	for word, score in similar_words:
	print(f" {word}: {score:.4f}")
	```

	### Calculate Similarity Between Two Words

	```python
	from sklearn.metrics.pairwise import cosine_similarity

	vec1 = model["muso"]
	vec2 = model["cɛ"]
	similarity = cosine_similarity([vec1], [vec2])[0][0]
	print(f"Similarity: {similarity:.4f}")
	```

	### Convert Text to Vector (Average of Word Vectors)

	```python
	import numpy as np

	def text_to_vector(text, model):
	words = text.lower().split()
	vectors = [model[w] for w in words if w in model.key_to_index]
	if not vectors:
	return np.zeros(model.vector_size)
	return np.mean(vectors, axis=0)

	text_vec = text_to_vector("Mali ye jamana ɲuman ye", model)
	print(f"Shape: {text_vec.shape}") # (300,)
	```

	### Search for Similar Texts

	```python
	from sklearn.metrics.pairwise import cosine_similarity
	import numpy as np

	def search_similar_texts(query, texts, model, top_k=5):
	query_vec = text_to_vector(query, model)
	results = []
	for i, text in enumerate(texts):
	text_vec = text_to_vector(text, model)
	if np.any(text_vec):
	sim = cosine_similarity([query_vec], [text_vec])[0][0]
	results.append((sim, text, i))
	results.sort(key=lambda x: x[0], reverse=True)
	return results[:top_k]

	texts = [
	"dumuni ɲuman bɛ here di",
	"bamako ye Mali faaba ye",
	"denmisɛnw bɛ kalan kɛ",
	]

	results = search_similar_texts("Mali jamana", texts, model)
	for score, text, idx in results:
	print(f" [{score:.4f}] {text}")
	```

	### Check if a Word Exists in the Vocabulary

	```python
	word = "bamako"
	if word in model.key_to_index:
	print(f"'{word}' is in the vocabulary")
	else:
	print(f"'{word}' is not in the vocabulary")
	```

	## Limitations

	- Vocabulary is limited to 9,973 words (though subword information helps with OOV words)
	- Performance depends on the quality and coverage of the training corpus
	- May not capture domain-specific terminology well
	- Embeddings reflect biases present in the training data

	## References

	```bibtex
	@misc{bambara-fasttext,
	author = {MALIBA-AI},
	title = {Bambara FastText Embeddings},
	year = {2025},
	publisher = {HuggingFace},
	howpublished = {\url{https://huggingface.co/MALIBA-AI/bambara-fasttext}}
	}
	@phdthesis{adelani2025nlp,
	title={Natural Language Processing for African Languages},
	author={Adelani, David Ifeoluwa},
	year={2025},
	school={Saarland University},
	note={arXiv:2507.00297}
	}
	```

	## License

	This project is licensed under Apache 2.0.

	## Contributing

	This is a project part of the [MALIBA-AI](https://huggingface.co/MALIBA-AI) initiative with the mission "No Malian Language Left Behind."

	---

	MALIBA-AI: Empowering Mali's Future Through Community-Driven AI Innovation

	"No Malian Language Left Behind"