MALIBA-AI
/

bambara-embeddings

@@ -25,7 +25,6 @@ This model provides FastText word embeddings for the Bambara language (Bamananka
 **Language:** Bambara (bm)
 **License:** Apache 2.0
 ## Model Details
 ### Model Architecture
@@ -36,10 +35,13 @@ This model provides FastText word embeddings for the Bambara language (Bamananka
 - **Subword Information:** Character n-grams (enables handling of out-of-vocabulary words)
 ### Training Data
-The model was trained on Bambara text corpora, building upon the work of David Ifeoluwa Adelani's research on African language embeddings.
 ### Intended Use
 This model is designed for:
 - **Semantic similarity tasks** in Bambara
 - **Information retrieval** for Bambara documents
 - **Cross-lingual research** involving Bambara
@@ -47,9 +49,150 @@ This model is designed for:
 - **Educational applications** for Bambara language learning
 - **Foundation for downstream NLP tasks** in Bambara
 ## Usage
 ```
-  Coming soon
 ```

 **Language:** Bambara (bm)
 **License:** Apache 2.0
 ## Model Details
 ### Model Architecture
 - **Subword Information:** Character n-grams (enables handling of out-of-vocabulary words)
 ### Training Data
+The model was trained on Bambara text corpora, building upon the work of [David Ifeoluwa Adelani's PhD dissertation](https://arxiv.org/abs/2507.00297) on natural language processing for African languages.
 ### Intended Use
 This model is designed for:
 - **Semantic similarity tasks** in Bambara
 - **Information retrieval** for Bambara documents
 - **Cross-lingual research** involving Bambara
 - **Educational applications** for Bambara language learning
 - **Foundation for downstream NLP tasks** in Bambara
+## Installation
+```bash
+pip install gensim huggingface_hub scikit-learn numpy
+```
 ## Usage
+### Load the Model
+```python
+import tempfile
+from gensim.models import KeyedVectors
+from huggingface_hub import hf_hub_download
+model_id = "MALIBA-AI/bambara-fasttext"
+# Download model files
+model_path = hf_hub_download(repo_id=model_id, filename="bam.bin", cache_dir=tempfile.gettempdir())
+vectors_path = hf_hub_download(repo_id=model_id, filename="bam.bin.vectors_ngrams.npy", cache_dir=tempfile.gettempdir())
+# Load model
+model = KeyedVectors.load(model_path)
+print(f"Vocabulary size: {len(model.key_to_index)}")
+print(f"Vector dimension: {model.vector_size}")
 ```
+### Get a Word Vector
+```python
+vector = model["bamako"]
+print(f"Shape: {vector.shape}")  # (300,)
+```
+### Find Similar Words
+```python
+similar_words = model.most_similar("dumuni", topn=10)
+for word, score in similar_words:
+    print(f"  {word}: {score:.4f}")
+```
+### Calculate Similarity Between Two Words
+```python
+from sklearn.metrics.pairwise import cosine_similarity
+vec1 = model["muso"]
+vec2 = model["cɛ"]
+similarity = cosine_similarity([vec1], [vec2])[0][0]
+print(f"Similarity: {similarity:.4f}")
 ```
+### Convert Text to Vector (Average of Word Vectors)
+```python
+import numpy as np
+def text_to_vector(text, model):
+    words = text.lower().split()
+    vectors = [model[w] for w in words if w in model.key_to_index]
+    if not vectors:
+        return np.zeros(model.vector_size)
+    return np.mean(vectors, axis=0)
+text_vec = text_to_vector("Mali ye jamana ɲuman ye", model)
+print(f"Shape: {text_vec.shape}")  # (300,)
+```
+### Search for Similar Texts
+```python
+from sklearn.metrics.pairwise import cosine_similarity
+import numpy as np
+def search_similar_texts(query, texts, model, top_k=5):
+    query_vec = text_to_vector(query, model)
+    results = []
+    for i, text in enumerate(texts):
+        text_vec = text_to_vector(text, model)
+        if np.any(text_vec):
+            sim = cosine_similarity([query_vec], [text_vec])[0][0]
+            results.append((sim, text, i))
+    results.sort(key=lambda x: x[0], reverse=True)
+    return results[:top_k]
+texts = [
+    "dumuni ɲuman bɛ here di",
+    "bamako ye Mali faaba ye",
+    "denmisɛnw bɛ kalan kɛ",
+]
+results = search_similar_texts("Mali jamana", texts, model)
+for score, text, idx in results:
+    print(f"  [{score:.4f}] {text}")
+```
+### Check if a Word Exists in the Vocabulary
+```python
+word = "bamako"
+if word in model.key_to_index:
+    print(f"'{word}' is in the vocabulary")
+else:
+    print(f"'{word}' is not in the vocabulary")
+```
+## Limitations
+- Vocabulary is limited to 9,973 words (though subword information helps with OOV words)
+- Performance depends on the quality and coverage of the training corpus
+- May not capture domain-specific terminology well
+- Embeddings reflect biases present in the training data
+## References
+```bibtex
+@misc{bambara-fasttext,
+  author = {MALIBA-AI},
+  title = {Bambara FastText Embeddings},
+  year = {2025},
+  publisher = {HuggingFace},
+  howpublished = {\url{https://huggingface.co/MALIBA-AI/bambara-fasttext}}
+}
+@phdthesis{adelani2025nlp,
+  title={Natural Language Processing for African Languages},
+  author={Adelani, David Ifeoluwa},
+  year={2025},
+  school={Saarland University},
+  note={arXiv:2507.00297}
+}
+```
+## License
+This project is licensed under Apache 2.0.
+## Contributing
+This is a project part of the [MALIBA-AI](https://huggingface.co/MALIBA-AI) initiative with the mission **"No Malian Language Left Behind."**
+---
+**MALIBA-AI: Empowering Mali's Future Through Community-Driven AI Innovation**
+*"No Malian Language Left Behind"*