bambara-embeddings / README.md
sudoping01's picture
Update README.md
8ab8ba5 verified
metadata
language: bm
tags:
  - bambara
  - fasttext
  - embeddings
  - word-vectors
  - african-nlp
  - low-resource
license: apache-2.0
datasets:
  - bambara-corpus
metrics:
  - cosine_similarity
pipeline_tag: feature-extraction

Bambara FastText Embeddings

Model Description

This model provides FastText word embeddings for the Bambara language (Bamanankan), a Mande language spoken primarily in Mali. The embeddings capture semantic relationships between Bambara words and enable various NLP tasks for this low-resource African language.

Model Type: FastText Word Embeddings
Language: Bambara (bm)
License: Apache 2.0

Model Details

Model Architecture

  • Algorithm: FastText with subword information
  • Vector Dimension: 300
  • Vocabulary Size: 9,973 unique Bambara words
  • Training Method: Skip-gram with negative sampling
  • Subword Information: Character n-grams (enables handling of out-of-vocabulary words)

Training Data

The model was trained on Bambara text corpora, building upon the work of David Ifeoluwa Adelani's PhD dissertation on natural language processing for African languages.

Intended Use

This model is designed for:

  • Semantic similarity tasks in Bambara
  • Information retrieval for Bambara documents
  • Cross-lingual research involving Bambara
  • Cultural preservation and digital humanities projects
  • Educational applications for Bambara language learning
  • Foundation for downstream NLP tasks in Bambara

Installation

pip install gensim huggingface_hub scikit-learn numpy

Usage

Load the Model

import tempfile
from gensim.models import KeyedVectors
from huggingface_hub import hf_hub_download

model_id = "MALIBA-AI/bambara-fasttext"

# Download model files
model_path = hf_hub_download(repo_id=model_id, filename="bam.bin", cache_dir=tempfile.gettempdir())
vectors_path = hf_hub_download(repo_id=model_id, filename="bam.bin.vectors_ngrams.npy", cache_dir=tempfile.gettempdir())

# Load model
model = KeyedVectors.load(model_path)

print(f"Vocabulary size: {len(model.key_to_index)}")
print(f"Vector dimension: {model.vector_size}")

Get a Word Vector

vector = model["bamako"]
print(f"Shape: {vector.shape}")  # (300,)

Find Similar Words

similar_words = model.most_similar("dumuni", topn=10)
for word, score in similar_words:
    print(f"  {word}: {score:.4f}")

Calculate Similarity Between Two Words

from sklearn.metrics.pairwise import cosine_similarity

vec1 = model["muso"]
vec2 = model["cɛ"]
similarity = cosine_similarity([vec1], [vec2])[0][0]
print(f"Similarity: {similarity:.4f}")

Convert Text to Vector (Average of Word Vectors)

import numpy as np

def text_to_vector(text, model):
    words = text.lower().split()
    vectors = [model[w] for w in words if w in model.key_to_index]
    if not vectors:
        return np.zeros(model.vector_size)
    return np.mean(vectors, axis=0)

text_vec = text_to_vector("Mali ye jamana ɲuman ye", model)
print(f"Shape: {text_vec.shape}")  # (300,)

Search for Similar Texts

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search_similar_texts(query, texts, model, top_k=5):
    query_vec = text_to_vector(query, model)
    results = []
    for i, text in enumerate(texts):
        text_vec = text_to_vector(text, model)
        if np.any(text_vec):
            sim = cosine_similarity([query_vec], [text_vec])[0][0]
            results.append((sim, text, i))
    results.sort(key=lambda x: x[0], reverse=True)
    return results[:top_k]

texts = [
    "dumuni ɲuman bɛ here di",
    "bamako ye Mali faaba ye",
    "denmisɛnw bɛ kalan kɛ",
]

results = search_similar_texts("Mali jamana", texts, model)
for score, text, idx in results:
    print(f"  [{score:.4f}] {text}")

Check if a Word Exists in the Vocabulary

word = "bamako"
if word in model.key_to_index:
    print(f"'{word}' is in the vocabulary")
else:
    print(f"'{word}' is not in the vocabulary")

Limitations

  • Vocabulary is limited to 9,973 words (though subword information helps with OOV words)
  • Performance depends on the quality and coverage of the training corpus
  • May not capture domain-specific terminology well
  • Embeddings reflect biases present in the training data

References

@misc{bambara-fasttext,
  author = {MALIBA-AI},
  title = {Bambara FastText Embeddings},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/MALIBA-AI/bambara-fasttext}}
}
@phdthesis{adelani2025nlp,
  title={Natural Language Processing for African Languages},
  author={Adelani, David Ifeoluwa},
  year={2025},
  school={Saarland University},
  note={arXiv:2507.00297}
}

License

This project is licensed under Apache 2.0.

Contributing

This is a project part of the MALIBA-AI initiative with the mission "No Malian Language Left Behind."


MALIBA-AI: Empowering Mali's Future Through Community-Driven AI Innovation

"No Malian Language Left Behind"