File size: 5,197 Bytes
65bcbe1 8ab8ba5 65bcbe1 8ab8ba5 65bcbe1 8ab8ba5 65bcbe1 8ab8ba5 65bcbe1 8ab8ba5 65bcbe1 8ab8ba5 65bcbe1 8ab8ba5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 | ---
language: bm
tags:
- bambara
- fasttext
- embeddings
- word-vectors
- african-nlp
- low-resource
license: apache-2.0
datasets:
- bambara-corpus
metrics:
- cosine_similarity
pipeline_tag: feature-extraction
---
# Bambara FastText Embeddings
## Model Description
This model provides FastText word embeddings for the Bambara language (Bamanankan), a Mande language spoken primarily in Mali. The embeddings capture semantic relationships between Bambara words and enable various NLP tasks for this low-resource African language.
**Model Type:** FastText Word Embeddings
**Language:** Bambara (bm)
**License:** Apache 2.0
## Model Details
### Model Architecture
- **Algorithm:** FastText with subword information
- **Vector Dimension:** 300
- **Vocabulary Size:** 9,973 unique Bambara words
- **Training Method:** Skip-gram with negative sampling
- **Subword Information:** Character n-grams (enables handling of out-of-vocabulary words)
### Training Data
The model was trained on Bambara text corpora, building upon the work of [David Ifeoluwa Adelani's PhD dissertation](https://arxiv.org/abs/2507.00297) on natural language processing for African languages.
### Intended Use
This model is designed for:
- **Semantic similarity tasks** in Bambara
- **Information retrieval** for Bambara documents
- **Cross-lingual research** involving Bambara
- **Cultural preservation** and digital humanities projects
- **Educational applications** for Bambara language learning
- **Foundation for downstream NLP tasks** in Bambara
## Installation
```bash
pip install gensim huggingface_hub scikit-learn numpy
```
## Usage
### Load the Model
```python
import tempfile
from gensim.models import KeyedVectors
from huggingface_hub import hf_hub_download
model_id = "MALIBA-AI/bambara-fasttext"
# Download model files
model_path = hf_hub_download(repo_id=model_id, filename="bam.bin", cache_dir=tempfile.gettempdir())
vectors_path = hf_hub_download(repo_id=model_id, filename="bam.bin.vectors_ngrams.npy", cache_dir=tempfile.gettempdir())
# Load model
model = KeyedVectors.load(model_path)
print(f"Vocabulary size: {len(model.key_to_index)}")
print(f"Vector dimension: {model.vector_size}")
```
### Get a Word Vector
```python
vector = model["bamako"]
print(f"Shape: {vector.shape}") # (300,)
```
### Find Similar Words
```python
similar_words = model.most_similar("dumuni", topn=10)
for word, score in similar_words:
print(f" {word}: {score:.4f}")
```
### Calculate Similarity Between Two Words
```python
from sklearn.metrics.pairwise import cosine_similarity
vec1 = model["muso"]
vec2 = model["cɛ"]
similarity = cosine_similarity([vec1], [vec2])[0][0]
print(f"Similarity: {similarity:.4f}")
```
### Convert Text to Vector (Average of Word Vectors)
```python
import numpy as np
def text_to_vector(text, model):
words = text.lower().split()
vectors = [model[w] for w in words if w in model.key_to_index]
if not vectors:
return np.zeros(model.vector_size)
return np.mean(vectors, axis=0)
text_vec = text_to_vector("Mali ye jamana ɲuman ye", model)
print(f"Shape: {text_vec.shape}") # (300,)
```
### Search for Similar Texts
```python
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def search_similar_texts(query, texts, model, top_k=5):
query_vec = text_to_vector(query, model)
results = []
for i, text in enumerate(texts):
text_vec = text_to_vector(text, model)
if np.any(text_vec):
sim = cosine_similarity([query_vec], [text_vec])[0][0]
results.append((sim, text, i))
results.sort(key=lambda x: x[0], reverse=True)
return results[:top_k]
texts = [
"dumuni ɲuman bɛ here di",
"bamako ye Mali faaba ye",
"denmisɛnw bɛ kalan kɛ",
]
results = search_similar_texts("Mali jamana", texts, model)
for score, text, idx in results:
print(f" [{score:.4f}] {text}")
```
### Check if a Word Exists in the Vocabulary
```python
word = "bamako"
if word in model.key_to_index:
print(f"'{word}' is in the vocabulary")
else:
print(f"'{word}' is not in the vocabulary")
```
## Limitations
- Vocabulary is limited to 9,973 words (though subword information helps with OOV words)
- Performance depends on the quality and coverage of the training corpus
- May not capture domain-specific terminology well
- Embeddings reflect biases present in the training data
## References
```bibtex
@misc{bambara-fasttext,
author = {MALIBA-AI},
title = {Bambara FastText Embeddings},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/MALIBA-AI/bambara-fasttext}}
}
@phdthesis{adelani2025nlp,
title={Natural Language Processing for African Languages},
author={Adelani, David Ifeoluwa},
year={2025},
school={Saarland University},
note={arXiv:2507.00297}
}
```
## License
This project is licensed under Apache 2.0.
## Contributing
This is a project part of the [MALIBA-AI](https://huggingface.co/MALIBA-AI) initiative with the mission **"No Malian Language Left Behind."**
---
**MALIBA-AI: Empowering Mali's Future Through Community-Driven AI Innovation**
*"No Malian Language Left Behind"* |