sudoping01 commited on
Commit
8ab8ba5
·
verified ·
1 Parent(s): 0c12e53

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +146 -3
README.md CHANGED
@@ -25,7 +25,6 @@ This model provides FastText word embeddings for the Bambara language (Bamananka
25
  **Language:** Bambara (bm)
26
  **License:** Apache 2.0
27
 
28
-
29
  ## Model Details
30
 
31
  ### Model Architecture
@@ -36,10 +35,13 @@ This model provides FastText word embeddings for the Bambara language (Bamananka
36
  - **Subword Information:** Character n-grams (enables handling of out-of-vocabulary words)
37
 
38
  ### Training Data
39
- The model was trained on Bambara text corpora, building upon the work of David Ifeoluwa Adelani's research on African language embeddings.
 
40
 
41
  ### Intended Use
 
42
  This model is designed for:
 
43
  - **Semantic similarity tasks** in Bambara
44
  - **Information retrieval** for Bambara documents
45
  - **Cross-lingual research** involving Bambara
@@ -47,9 +49,150 @@ This model is designed for:
47
  - **Educational applications** for Bambara language learning
48
  - **Foundation for downstream NLP tasks** in Bambara
49
 
 
 
 
 
 
50
 
51
  ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  ```
53
- Coming soon
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  ```
55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  **Language:** Bambara (bm)
26
  **License:** Apache 2.0
27
 
 
28
  ## Model Details
29
 
30
  ### Model Architecture
 
35
  - **Subword Information:** Character n-grams (enables handling of out-of-vocabulary words)
36
 
37
  ### Training Data
38
+
39
+ The model was trained on Bambara text corpora, building upon the work of [David Ifeoluwa Adelani's PhD dissertation](https://arxiv.org/abs/2507.00297) on natural language processing for African languages.
40
 
41
  ### Intended Use
42
+
43
  This model is designed for:
44
+
45
  - **Semantic similarity tasks** in Bambara
46
  - **Information retrieval** for Bambara documents
47
  - **Cross-lingual research** involving Bambara
 
49
  - **Educational applications** for Bambara language learning
50
  - **Foundation for downstream NLP tasks** in Bambara
51
 
52
+ ## Installation
53
+
54
+ ```bash
55
+ pip install gensim huggingface_hub scikit-learn numpy
56
+ ```
57
 
58
  ## Usage
59
+
60
+ ### Load the Model
61
+
62
+ ```python
63
+ import tempfile
64
+ from gensim.models import KeyedVectors
65
+ from huggingface_hub import hf_hub_download
66
+
67
+ model_id = "MALIBA-AI/bambara-fasttext"
68
+
69
+ # Download model files
70
+ model_path = hf_hub_download(repo_id=model_id, filename="bam.bin", cache_dir=tempfile.gettempdir())
71
+ vectors_path = hf_hub_download(repo_id=model_id, filename="bam.bin.vectors_ngrams.npy", cache_dir=tempfile.gettempdir())
72
+
73
+ # Load model
74
+ model = KeyedVectors.load(model_path)
75
+
76
+ print(f"Vocabulary size: {len(model.key_to_index)}")
77
+ print(f"Vector dimension: {model.vector_size}")
78
  ```
79
+
80
+ ### Get a Word Vector
81
+
82
+ ```python
83
+ vector = model["bamako"]
84
+ print(f"Shape: {vector.shape}") # (300,)
85
+ ```
86
+
87
+ ### Find Similar Words
88
+
89
+ ```python
90
+ similar_words = model.most_similar("dumuni", topn=10)
91
+ for word, score in similar_words:
92
+ print(f" {word}: {score:.4f}")
93
+ ```
94
+
95
+ ### Calculate Similarity Between Two Words
96
+
97
+ ```python
98
+ from sklearn.metrics.pairwise import cosine_similarity
99
+
100
+ vec1 = model["muso"]
101
+ vec2 = model["cɛ"]
102
+ similarity = cosine_similarity([vec1], [vec2])[0][0]
103
+ print(f"Similarity: {similarity:.4f}")
104
  ```
105
 
106
+ ### Convert Text to Vector (Average of Word Vectors)
107
+
108
+ ```python
109
+ import numpy as np
110
+
111
+ def text_to_vector(text, model):
112
+ words = text.lower().split()
113
+ vectors = [model[w] for w in words if w in model.key_to_index]
114
+ if not vectors:
115
+ return np.zeros(model.vector_size)
116
+ return np.mean(vectors, axis=0)
117
+
118
+ text_vec = text_to_vector("Mali ye jamana ɲuman ye", model)
119
+ print(f"Shape: {text_vec.shape}") # (300,)
120
+ ```
121
+
122
+ ### Search for Similar Texts
123
+
124
+ ```python
125
+ from sklearn.metrics.pairwise import cosine_similarity
126
+ import numpy as np
127
+
128
+ def search_similar_texts(query, texts, model, top_k=5):
129
+ query_vec = text_to_vector(query, model)
130
+ results = []
131
+ for i, text in enumerate(texts):
132
+ text_vec = text_to_vector(text, model)
133
+ if np.any(text_vec):
134
+ sim = cosine_similarity([query_vec], [text_vec])[0][0]
135
+ results.append((sim, text, i))
136
+ results.sort(key=lambda x: x[0], reverse=True)
137
+ return results[:top_k]
138
+
139
+ texts = [
140
+ "dumuni ɲuman bɛ here di",
141
+ "bamako ye Mali faaba ye",
142
+ "denmisɛnw bɛ kalan kɛ",
143
+ ]
144
+
145
+ results = search_similar_texts("Mali jamana", texts, model)
146
+ for score, text, idx in results:
147
+ print(f" [{score:.4f}] {text}")
148
+ ```
149
+
150
+ ### Check if a Word Exists in the Vocabulary
151
+
152
+ ```python
153
+ word = "bamako"
154
+ if word in model.key_to_index:
155
+ print(f"'{word}' is in the vocabulary")
156
+ else:
157
+ print(f"'{word}' is not in the vocabulary")
158
+ ```
159
+
160
+ ## Limitations
161
+
162
+ - Vocabulary is limited to 9,973 words (though subword information helps with OOV words)
163
+ - Performance depends on the quality and coverage of the training corpus
164
+ - May not capture domain-specific terminology well
165
+ - Embeddings reflect biases present in the training data
166
+
167
+ ## References
168
+
169
+ ```bibtex
170
+ @misc{bambara-fasttext,
171
+ author = {MALIBA-AI},
172
+ title = {Bambara FastText Embeddings},
173
+ year = {2025},
174
+ publisher = {HuggingFace},
175
+ howpublished = {\url{https://huggingface.co/MALIBA-AI/bambara-fasttext}}
176
+ }
177
+ @phdthesis{adelani2025nlp,
178
+ title={Natural Language Processing for African Languages},
179
+ author={Adelani, David Ifeoluwa},
180
+ year={2025},
181
+ school={Saarland University},
182
+ note={arXiv:2507.00297}
183
+ }
184
+ ```
185
+
186
+ ## License
187
+
188
+ This project is licensed under Apache 2.0.
189
+
190
+ ## Contributing
191
+
192
+ This is a project part of the [MALIBA-AI](https://huggingface.co/MALIBA-AI) initiative with the mission **"No Malian Language Left Behind."**
193
+
194
+ ---
195
+
196
+ **MALIBA-AI: Empowering Mali's Future Through Community-Driven AI Innovation**
197
+
198
+ *"No Malian Language Left Behind"*