|
|
--- |
|
|
language: bm |
|
|
tags: |
|
|
- bambara |
|
|
- fasttext |
|
|
- embeddings |
|
|
- word-vectors |
|
|
- african-nlp |
|
|
- low-resource |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- bambara-corpus |
|
|
metrics: |
|
|
- cosine_similarity |
|
|
pipeline_tag: feature-extraction |
|
|
--- |
|
|
|
|
|
# Bambara FastText Embeddings |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model provides FastText word embeddings for the Bambara language (Bamanankan), a Mande language spoken primarily in Mali. The embeddings capture semantic relationships between Bambara words and enable various NLP tasks for this low-resource African language. |
|
|
|
|
|
**Model Type:** FastText Word Embeddings |
|
|
**Language:** Bambara (bm) |
|
|
**License:** Apache 2.0 |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Architecture |
|
|
- **Algorithm:** FastText with subword information |
|
|
- **Vector Dimension:** 300 |
|
|
- **Vocabulary Size:** 9,973 unique Bambara words |
|
|
- **Training Method:** Skip-gram with negative sampling |
|
|
- **Subword Information:** Character n-grams (enables handling of out-of-vocabulary words) |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was trained on Bambara text corpora, building upon the work of [David Ifeoluwa Adelani's PhD dissertation](https://arxiv.org/abs/2507.00297) on natural language processing for African languages. |
|
|
|
|
|
### Intended Use |
|
|
|
|
|
This model is designed for: |
|
|
|
|
|
- **Semantic similarity tasks** in Bambara |
|
|
- **Information retrieval** for Bambara documents |
|
|
- **Cross-lingual research** involving Bambara |
|
|
- **Cultural preservation** and digital humanities projects |
|
|
- **Educational applications** for Bambara language learning |
|
|
- **Foundation for downstream NLP tasks** in Bambara |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install gensim huggingface_hub scikit-learn numpy |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Load the Model |
|
|
|
|
|
```python |
|
|
import tempfile |
|
|
from gensim.models import KeyedVectors |
|
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
model_id = "MALIBA-AI/bambara-fasttext" |
|
|
|
|
|
# Download model files |
|
|
model_path = hf_hub_download(repo_id=model_id, filename="bam.bin", cache_dir=tempfile.gettempdir()) |
|
|
vectors_path = hf_hub_download(repo_id=model_id, filename="bam.bin.vectors_ngrams.npy", cache_dir=tempfile.gettempdir()) |
|
|
|
|
|
# Load model |
|
|
model = KeyedVectors.load(model_path) |
|
|
|
|
|
print(f"Vocabulary size: {len(model.key_to_index)}") |
|
|
print(f"Vector dimension: {model.vector_size}") |
|
|
``` |
|
|
|
|
|
### Get a Word Vector |
|
|
|
|
|
```python |
|
|
vector = model["bamako"] |
|
|
print(f"Shape: {vector.shape}") # (300,) |
|
|
``` |
|
|
|
|
|
### Find Similar Words |
|
|
|
|
|
```python |
|
|
similar_words = model.most_similar("dumuni", topn=10) |
|
|
for word, score in similar_words: |
|
|
print(f" {word}: {score:.4f}") |
|
|
``` |
|
|
|
|
|
### Calculate Similarity Between Two Words |
|
|
|
|
|
```python |
|
|
from sklearn.metrics.pairwise import cosine_similarity |
|
|
|
|
|
vec1 = model["muso"] |
|
|
vec2 = model["cɛ"] |
|
|
similarity = cosine_similarity([vec1], [vec2])[0][0] |
|
|
print(f"Similarity: {similarity:.4f}") |
|
|
``` |
|
|
|
|
|
### Convert Text to Vector (Average of Word Vectors) |
|
|
|
|
|
```python |
|
|
import numpy as np |
|
|
|
|
|
def text_to_vector(text, model): |
|
|
words = text.lower().split() |
|
|
vectors = [model[w] for w in words if w in model.key_to_index] |
|
|
if not vectors: |
|
|
return np.zeros(model.vector_size) |
|
|
return np.mean(vectors, axis=0) |
|
|
|
|
|
text_vec = text_to_vector("Mali ye jamana ɲuman ye", model) |
|
|
print(f"Shape: {text_vec.shape}") # (300,) |
|
|
``` |
|
|
|
|
|
### Search for Similar Texts |
|
|
|
|
|
```python |
|
|
from sklearn.metrics.pairwise import cosine_similarity |
|
|
import numpy as np |
|
|
|
|
|
def search_similar_texts(query, texts, model, top_k=5): |
|
|
query_vec = text_to_vector(query, model) |
|
|
results = [] |
|
|
for i, text in enumerate(texts): |
|
|
text_vec = text_to_vector(text, model) |
|
|
if np.any(text_vec): |
|
|
sim = cosine_similarity([query_vec], [text_vec])[0][0] |
|
|
results.append((sim, text, i)) |
|
|
results.sort(key=lambda x: x[0], reverse=True) |
|
|
return results[:top_k] |
|
|
|
|
|
texts = [ |
|
|
"dumuni ɲuman bɛ here di", |
|
|
"bamako ye Mali faaba ye", |
|
|
"denmisɛnw bɛ kalan kɛ", |
|
|
] |
|
|
|
|
|
results = search_similar_texts("Mali jamana", texts, model) |
|
|
for score, text, idx in results: |
|
|
print(f" [{score:.4f}] {text}") |
|
|
``` |
|
|
|
|
|
### Check if a Word Exists in the Vocabulary |
|
|
|
|
|
```python |
|
|
word = "bamako" |
|
|
if word in model.key_to_index: |
|
|
print(f"'{word}' is in the vocabulary") |
|
|
else: |
|
|
print(f"'{word}' is not in the vocabulary") |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Vocabulary is limited to 9,973 words (though subword information helps with OOV words) |
|
|
- Performance depends on the quality and coverage of the training corpus |
|
|
- May not capture domain-specific terminology well |
|
|
- Embeddings reflect biases present in the training data |
|
|
|
|
|
## References |
|
|
|
|
|
```bibtex |
|
|
@misc{bambara-fasttext, |
|
|
author = {MALIBA-AI}, |
|
|
title = {Bambara FastText Embeddings}, |
|
|
year = {2025}, |
|
|
publisher = {HuggingFace}, |
|
|
howpublished = {\url{https://huggingface.co/MALIBA-AI/bambara-fasttext}} |
|
|
} |
|
|
@phdthesis{adelani2025nlp, |
|
|
title={Natural Language Processing for African Languages}, |
|
|
author={Adelani, David Ifeoluwa}, |
|
|
year={2025}, |
|
|
school={Saarland University}, |
|
|
note={arXiv:2507.00297} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This project is licensed under Apache 2.0. |
|
|
|
|
|
## Contributing |
|
|
|
|
|
This is a project part of the [MALIBA-AI](https://huggingface.co/MALIBA-AI) initiative with the mission **"No Malian Language Left Behind."** |
|
|
|
|
|
--- |
|
|
|
|
|
**MALIBA-AI: Empowering Mali's Future Through Community-Driven AI Innovation** |
|
|
|
|
|
*"No Malian Language Left Behind"* |