--- language: bm tags: - bambara - fasttext - embeddings - word-vectors - african-nlp - low-resource license: apache-2.0 datasets: - bambara-corpus metrics: - cosine_similarity pipeline_tag: feature-extraction --- # Bambara FastText Embeddings ## Model Description This model provides FastText word embeddings for the Bambara language (Bamanankan), a Mande language spoken primarily in Mali. The embeddings capture semantic relationships between Bambara words and enable various NLP tasks for this low-resource African language. **Model Type:** FastText Word Embeddings **Language:** Bambara (bm) **License:** Apache 2.0 ## Model Details ### Model Architecture - **Algorithm:** FastText with subword information - **Vector Dimension:** 300 - **Vocabulary Size:** 9,973 unique Bambara words - **Training Method:** Skip-gram with negative sampling - **Subword Information:** Character n-grams (enables handling of out-of-vocabulary words) ### Training Data The model was trained on Bambara text corpora, building upon the work of [David Ifeoluwa Adelani's PhD dissertation](https://arxiv.org/abs/2507.00297) on natural language processing for African languages. ### Intended Use This model is designed for: - **Semantic similarity tasks** in Bambara - **Information retrieval** for Bambara documents - **Cross-lingual research** involving Bambara - **Cultural preservation** and digital humanities projects - **Educational applications** for Bambara language learning - **Foundation for downstream NLP tasks** in Bambara ## Installation ```bash pip install gensim huggingface_hub scikit-learn numpy ``` ## Usage ### Load the Model ```python import tempfile from gensim.models import KeyedVectors from huggingface_hub import hf_hub_download model_id = "MALIBA-AI/bambara-fasttext" # Download model files model_path = hf_hub_download(repo_id=model_id, filename="bam.bin", cache_dir=tempfile.gettempdir()) vectors_path = hf_hub_download(repo_id=model_id, filename="bam.bin.vectors_ngrams.npy", cache_dir=tempfile.gettempdir()) # Load model model = KeyedVectors.load(model_path) print(f"Vocabulary size: {len(model.key_to_index)}") print(f"Vector dimension: {model.vector_size}") ``` ### Get a Word Vector ```python vector = model["bamako"] print(f"Shape: {vector.shape}") # (300,) ``` ### Find Similar Words ```python similar_words = model.most_similar("dumuni", topn=10) for word, score in similar_words: print(f" {word}: {score:.4f}") ``` ### Calculate Similarity Between Two Words ```python from sklearn.metrics.pairwise import cosine_similarity vec1 = model["muso"] vec2 = model["cɛ"] similarity = cosine_similarity([vec1], [vec2])[0][0] print(f"Similarity: {similarity:.4f}") ``` ### Convert Text to Vector (Average of Word Vectors) ```python import numpy as np def text_to_vector(text, model): words = text.lower().split() vectors = [model[w] for w in words if w in model.key_to_index] if not vectors: return np.zeros(model.vector_size) return np.mean(vectors, axis=0) text_vec = text_to_vector("Mali ye jamana ɲuman ye", model) print(f"Shape: {text_vec.shape}") # (300,) ``` ### Search for Similar Texts ```python from sklearn.metrics.pairwise import cosine_similarity import numpy as np def search_similar_texts(query, texts, model, top_k=5): query_vec = text_to_vector(query, model) results = [] for i, text in enumerate(texts): text_vec = text_to_vector(text, model) if np.any(text_vec): sim = cosine_similarity([query_vec], [text_vec])[0][0] results.append((sim, text, i)) results.sort(key=lambda x: x[0], reverse=True) return results[:top_k] texts = [ "dumuni ɲuman bɛ here di", "bamako ye Mali faaba ye", "denmisɛnw bɛ kalan kɛ", ] results = search_similar_texts("Mali jamana", texts, model) for score, text, idx in results: print(f" [{score:.4f}] {text}") ``` ### Check if a Word Exists in the Vocabulary ```python word = "bamako" if word in model.key_to_index: print(f"'{word}' is in the vocabulary") else: print(f"'{word}' is not in the vocabulary") ``` ## Limitations - Vocabulary is limited to 9,973 words (though subword information helps with OOV words) - Performance depends on the quality and coverage of the training corpus - May not capture domain-specific terminology well - Embeddings reflect biases present in the training data ## References ```bibtex @misc{bambara-fasttext, author = {MALIBA-AI}, title = {Bambara FastText Embeddings}, year = {2025}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/MALIBA-AI/bambara-fasttext}} } @phdthesis{adelani2025nlp, title={Natural Language Processing for African Languages}, author={Adelani, David Ifeoluwa}, year={2025}, school={Saarland University}, note={arXiv:2507.00297} } ``` ## License This project is licensed under Apache 2.0. ## Contributing This is a project part of the [MALIBA-AI](https://huggingface.co/MALIBA-AI) initiative with the mission **"No Malian Language Left Behind."** --- **MALIBA-AI: Empowering Mali's Future Through Community-Driven AI Innovation** *"No Malian Language Left Behind"*