Update README.md
Browse files
README.md
CHANGED
|
@@ -11,12 +11,15 @@ tags:
|
|
| 11 |
- cpims
|
| 12 |
- african-nlp
|
| 13 |
- afriberta
|
|
|
|
| 14 |
base_model: castorini/afriberta_large
|
| 15 |
license: apache-2.0
|
| 16 |
metrics:
|
| 17 |
- perplexity
|
| 18 |
library_name: transformers
|
| 19 |
---
|
|
|
|
|
|
|
| 20 |
|
| 21 |
# AfriBERT Kenya β Domain-Adapted Language Model
|
| 22 |
|
|
@@ -230,21 +233,239 @@ Compared to the previous version trained on `distilbert-base-uncased` with 271 r
|
|
| 230 |
|
| 231 |
---
|
| 232 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 233 |
## Usage
|
| 234 |
|
|
|
|
|
|
|
| 235 |
```python
|
| 236 |
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
|
| 237 |
|
| 238 |
tokenizer = AutoTokenizer.from_pretrained("Rogendo/afribert-kenya-adapted")
|
| 239 |
model = AutoModelForMaskedLM.from_pretrained("Rogendo/afribert-kenya-adapted")
|
| 240 |
|
| 241 |
-
# Masked token prediction
|
| 242 |
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
|
| 243 |
-
|
|
|
|
|
|
|
| 244 |
for r in results:
|
| 245 |
print(f"{r['token_str']:<20} {r['score']:.3f}")
|
| 246 |
```
|
| 247 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 248 |
### As a base model for fine-tuning (jenga_ai SDK)
|
| 249 |
|
| 250 |
```yaml
|
|
@@ -290,4 +511,4 @@ If you use this model, please cite the base model:
|
|
| 290 |
|
| 291 |
## Author
|
| 292 |
|
| 293 |
-
**Rogendo** β built as part of the JengaAI CPIMS NLP pipeline for Kenyan child-protection support systems.
|
|
|
|
| 11 |
- cpims
|
| 12 |
- african-nlp
|
| 13 |
- afriberta
|
| 14 |
+
|
| 15 |
base_model: castorini/afriberta_large
|
| 16 |
license: apache-2.0
|
| 17 |
metrics:
|
| 18 |
- perplexity
|
| 19 |
library_name: transformers
|
| 20 |
---
|
| 21 |
+
license: apache-2.0
|
| 22 |
+
---
|
| 23 |
|
| 24 |
# AfriBERT Kenya β Domain-Adapted Language Model
|
| 25 |
|
|
|
|
| 233 |
|
| 234 |
---
|
| 235 |
|
| 236 |
+
## Use Cases & Practical Domains
|
| 237 |
+
|
| 238 |
+
This model is designed for any NLP task involving Kenyan language text. It provides a stronger starting point than a generic multilingual model wherever the input contains Swahili, Sheng, code-switching, or Kenyan institutional vocabulary.
|
| 239 |
+
|
| 240 |
+
### 1. Child Protection & Social Work (CPIMS)
|
| 241 |
+
|
| 242 |
+
The primary motivation for this model. Kenya's Child Protection Information Management System (CPIMS) generates a high volume of support requests, case notes, and field reports written by social workers, case managers, and NGO staff β often in a mix of English, Swahili, and Sheng.
|
| 243 |
+
|
| 244 |
+
**Practical tasks:**
|
| 245 |
+
|
| 246 |
+
| Task | Description | Example input |
|
| 247 |
+
|---|---|---|
|
| 248 |
+
| **Help-desk intent classification** | Route incoming support messages to the correct team or knowledge base article | *"Siwezi kuingia system, password yangu imekwisha"* β `PasswordReset` |
|
| 249 |
+
| **Urgency triage** | Flag messages that need immediate human escalation (child at risk, abuse, missing child) | *"Mtoto amekimbia safe house usiku huu"* β `urgent` |
|
| 250 |
+
| **Case note sentiment** | Detect frustration or distress in field worker messages to trigger supervisor review | *"Nimejaribu mara nyingi kupata msaada lakini hakuna anayejibu"* β `negative` |
|
| 251 |
+
| **Entity extraction (NER)** | Extract names, locations, case IDs, and child ages from free-text case notes | *"Amina, miaka 9, Kibera, Case ID CP-2024-0471"* |
|
| 252 |
+
| **Automated case routing** | Predict which department or OVC program a case should be assigned to | Based on case note text |
|
| 253 |
+
|
| 254 |
+
---
|
| 255 |
+
|
| 256 |
+
### 2. Financial Services & M-PESA
|
| 257 |
+
|
| 258 |
+
M-PESA is Kenya's dominant mobile money platform used by over 30 million Kenyans. Customer support queries, fraud reports, and transaction disputes are frequently written in Swahili or code-switched language that generic models mishandle.
|
| 259 |
+
|
| 260 |
+
**Practical tasks:**
|
| 261 |
+
|
| 262 |
+
| Task | Description | Example input |
|
| 263 |
+
|---|---|---|
|
| 264 |
+
| **Transaction dispute classification** | Categorise dispute type: wrong number, reversal, Fuliza, till payment, paybill | *"Nilituma pesa nambari mbaya, naomba reverse"* |
|
| 265 |
+
| **Fraud signal detection** | Detect social-engineering scripts, phishing attempts, SIM-swap language | *"Uko na nambari ya siri ya M-PESA? Niambie utatumia"* |
|
| 266 |
+
| **Customer sentiment analysis** | Measure customer satisfaction from M-PESA helpline transcripts | Post-interaction classification |
|
| 267 |
+
| **FAQ intent matching** | Match a customer query to the nearest self-service FAQ answer | Semantic similarity over a FAQ corpus |
|
| 268 |
+
| **Agent response quality scoring** | Score whether a customer service agent's response was appropriate | Given query + response pairs |
|
| 269 |
+
|
| 270 |
+
---
|
| 271 |
+
|
| 272 |
+
### 3. Healthcare & Community Health Workers (CHWs)
|
| 273 |
+
|
| 274 |
+
Community Health Workers in Kenya file visit reports and referral notes, often verbally transcribed or typed on low-end phones in mixed Swahili/English.
|
| 275 |
+
|
| 276 |
+
**Practical tasks:**
|
| 277 |
+
|
| 278 |
+
| Task | Description | Example input |
|
| 279 |
+
|---|---|---|
|
| 280 |
+
| **Symptom extraction** | Extract reported symptoms from CHW visit notes | *"Mtoto ana homa kali na kukohoa sana tangu jana"* |
|
| 281 |
+
| **Referral urgency classification** | Triage referral notes: emergency, routine, follow-up | *"Mama mjamzito ana maumivu makali, nahitaji ambulance sasa"* β `emergency` |
|
| 282 |
+
| **Facility routing** | Predict whether a patient should go to dispensary, health centre, or county hospital | Based on symptom description |
|
| 283 |
+
| **Health campaign text classification** | Classify community feedback on health campaigns (vaccination, family planning) | SMS/WhatsApp response categorisation |
|
| 284 |
+
|
| 285 |
+
---
|
| 286 |
+
|
| 287 |
+
### 4. Education & EdTech
|
| 288 |
+
|
| 289 |
+
Kenya's education sector uses a blend of English instruction and Swahili explanation, especially in lower grades. Many EdTech platforms serving rural Kenya receive student questions in Sheng or code-switched text.
|
| 290 |
+
|
| 291 |
+
**Practical tasks:**
|
| 292 |
+
|
| 293 |
+
| Task | Description | Example input |
|
| 294 |
+
|---|---|---|
|
| 295 |
+
| **Student question topic classification** | Route a question to the right subject tutor or resource | *"Sijui kusolve equation hii, pia sina calculator"* |
|
| 296 |
+
| **Learner frustration detection** | Flag messages indicating confusion or disengagement | *"Sielewi hata kidogo, imefail mara tatu"* |
|
| 297 |
+
| **Automatic feedback categorisation** | Classify teacher or parent feedback on school platforms | SMS / app reviews |
|
| 298 |
+
| **Readability scoring** | Score educational content for appropriateness at different grade levels | Given a paragraph of Swahili text |
|
| 299 |
+
|
| 300 |
+
---
|
| 301 |
+
|
| 302 |
+
### 5. Government & Civic Services
|
| 303 |
+
|
| 304 |
+
Kenya's e-citizen platforms, county service desks, and public feedback systems receive queries and complaints in everyday Kenyan language.
|
| 305 |
+
|
| 306 |
+
**Practical tasks:**
|
| 307 |
+
|
| 308 |
+
| Task | Description | Example input |
|
| 309 |
+
|---|---|---|
|
| 310 |
+
| **Service request classification** | Route citizen petitions/complaints to the correct county department | *"Barabara ya kwetu ina mashimo makubwa sana, lini mtarekebisha?"* |
|
| 311 |
+
| **Complaint sentiment & severity** | Detect strongly negative or potentially viral citizen complaints | Social media monitoring |
|
| 312 |
+
| **Language identification** | Detect whether a message is Swahili, Sheng, English, or code-switched | Pre-routing in multi-language systems |
|
| 313 |
+
| **Policy document Q&A** | Answer questions grounded in Swahili government policy documents | Retrieval-augmented generation (RAG) with this encoder |
|
| 314 |
+
|
| 315 |
+
---
|
| 316 |
+
|
| 317 |
+
### 6. Media, Social Listening & Misinformation
|
| 318 |
+
|
| 319 |
+
Twitter/X, Facebook, and WhatsApp in Kenya carry a large volume of Kenyan Sheng and code-switched content that standard multilingual models struggle to classify.
|
| 320 |
+
|
| 321 |
+
**Practical tasks:**
|
| 322 |
+
|
| 323 |
+
| Task | Description | Example input |
|
| 324 |
+
|---|---|---|
|
| 325 |
+
| **Hate speech / harmful content detection** | Detect Sheng-coded hate speech or incitement that generic models miss | Election-period social media moderation |
|
| 326 |
+
| **Rumour / misinformation flagging** | Classify claims as verified, unverified, or disputed | WhatsApp forward monitoring |
|
| 327 |
+
| **Topic classification** | Assign news articles or social posts to categories (politics, economy, sports, health) | Media monitoring dashboards |
|
| 328 |
+
| **Sentiment analysis** | Measure public sentiment on policy announcements, brands, or events | Code-switched Twitter/X data |
|
| 329 |
+
|
| 330 |
+
---
|
| 331 |
+
|
| 332 |
+
## Fine-tuning Guide
|
| 333 |
+
|
| 334 |
+
This model can be fine-tuned with as few as **200β500 labelled examples** per class for most classification tasks, because DAPT has already adapted the internal representations to the target domain.
|
| 335 |
+
|
| 336 |
+
### Recommended fine-tuning tasks by architecture
|
| 337 |
+
|
| 338 |
+
| Architecture | Suitable for | HuggingFace class |
|
| 339 |
+
|---|---|---|
|
| 340 |
+
| Sequence classification | Intent, sentiment, urgency, topic, routing | `AutoModelForSequenceClassification` |
|
| 341 |
+
| Token classification | NER (names, locations, case IDs, symptoms) | `AutoModelForTokenClassification` |
|
| 342 |
+
| Multi-task (shared encoder + multiple heads) | Intent + urgency simultaneously | Custom (see jenga_ai SDK) |
|
| 343 |
+
| Question answering | Policy/FAQ grounding | `AutoModelForQuestionAnswering` |
|
| 344 |
+
| Sentence similarity | Semantic search, FAQ matching | Add a pooling head + contrastive loss |
|
| 345 |
+
|
| 346 |
+
### Minimum data guidelines
|
| 347 |
+
|
| 348 |
+
| Task complexity | Approx. labelled examples needed |
|
| 349 |
+
|---|---|
|
| 350 |
+
| Binary classification (2 classes) | 100β300 per class |
|
| 351 |
+
| Multi-class (5β15 classes) | 150β400 per class |
|
| 352 |
+
| Multi-class (15β63 classes) | 200β500 per class |
|
| 353 |
+
| NER (token-level) | 500β1,000 sentences with full annotation |
|
| 354 |
+
| Multi-task (2 heads) | Same as above per task head |
|
| 355 |
+
|
| 356 |
+
*These estimates are based on domain-adapted models. A generic multilingual base model would need 3β5Γ more data to reach equivalent performance on Kenyan text.*
|
| 357 |
+
|
| 358 |
+
### Fine-tuning with HuggingFace Trainer
|
| 359 |
+
|
| 360 |
+
```python
|
| 361 |
+
from transformers import (
|
| 362 |
+
AutoTokenizer,
|
| 363 |
+
AutoModelForSequenceClassification,
|
| 364 |
+
TrainingArguments,
|
| 365 |
+
Trainer,
|
| 366 |
+
)
|
| 367 |
+
|
| 368 |
+
model_name = "Rogendo/afribert-kenya-adapted"
|
| 369 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 370 |
+
model = AutoModelForSequenceClassification.from_pretrained(
|
| 371 |
+
model_name, num_labels=3 # e.g. urgency: low / medium / high
|
| 372 |
+
)
|
| 373 |
+
|
| 374 |
+
training_args = TrainingArguments(
|
| 375 |
+
output_dir = "my-kenya-classifier",
|
| 376 |
+
num_train_epochs = 5,
|
| 377 |
+
per_device_train_batch_size = 16,
|
| 378 |
+
learning_rate = 2e-5, # standard fine-tuning LR
|
| 379 |
+
warmup_ratio = 0.1,
|
| 380 |
+
evaluation_strategy = "epoch",
|
| 381 |
+
save_strategy = "epoch",
|
| 382 |
+
load_best_model_at_end = True,
|
| 383 |
+
bf16 = True, # use bf16 on A100/A40/H100
|
| 384 |
+
)
|
| 385 |
+
|
| 386 |
+
trainer = Trainer(
|
| 387 |
+
model = model,
|
| 388 |
+
args = training_args,
|
| 389 |
+
train_dataset = train_dataset,
|
| 390 |
+
eval_dataset = eval_dataset,
|
| 391 |
+
processing_class = tokenizer,
|
| 392 |
+
)
|
| 393 |
+
trainer.train()
|
| 394 |
+
```
|
| 395 |
+
|
| 396 |
+
### Fine-tuning with jenga_ai SDK (multi-task)
|
| 397 |
+
|
| 398 |
+
```yaml
|
| 399 |
+
# cpims_config.yaml
|
| 400 |
+
model:
|
| 401 |
+
base_model: Rogendo/afribert-kenya-adapted
|
| 402 |
+
max_seq_len: 128
|
| 403 |
+
|
| 404 |
+
tasks:
|
| 405 |
+
- name: intent
|
| 406 |
+
task_type: multi_class_classification
|
| 407 |
+
num_labels: 63
|
| 408 |
+
label_column: intent
|
| 409 |
+
|
| 410 |
+
- name: urgency
|
| 411 |
+
task_type: multi_class_classification
|
| 412 |
+
num_labels: 3
|
| 413 |
+
label_column: urgency
|
| 414 |
+
|
| 415 |
+
training:
|
| 416 |
+
epochs: 5
|
| 417 |
+
batch_size: 16
|
| 418 |
+
learning_rate: 2.0e-5
|
| 419 |
+
output_dir: results/cpims-v2
|
| 420 |
+
```
|
| 421 |
+
|
| 422 |
+
```bash
|
| 423 |
+
python -m jenga_ai train --config cpims_config.yaml
|
| 424 |
+
```
|
| 425 |
+
|
| 426 |
+
---
|
| 427 |
+
|
| 428 |
## Usage
|
| 429 |
|
| 430 |
+
### Single mask prediction
|
| 431 |
+
|
| 432 |
```python
|
| 433 |
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
|
| 434 |
|
| 435 |
tokenizer = AutoTokenizer.from_pretrained("Rogendo/afribert-kenya-adapted")
|
| 436 |
model = AutoModelForMaskedLM.from_pretrained("Rogendo/afribert-kenya-adapted")
|
| 437 |
|
|
|
|
| 438 |
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
|
| 439 |
+
|
| 440 |
+
# Real Sheng sentence β single mask
|
| 441 |
+
results = fill_mask(f"Oya, twendeni zetu, kuna {tokenizer.mask_token} flani ameniudhi. Uyo msee aliiba doh zangu most.")
|
| 442 |
for r in results:
|
| 443 |
print(f"{r['token_str']:<20} {r['score']:.3f}")
|
| 444 |
```
|
| 445 |
|
| 446 |
+
### Multiple masks (one position at a time)
|
| 447 |
+
|
| 448 |
+
```python
|
| 449 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
|
| 450 |
+
|
| 451 |
+
tokenizer = AutoTokenizer.from_pretrained("Rogendo/afribert-kenya-adapted")
|
| 452 |
+
model = AutoModelForMaskedLM.from_pretrained("Rogendo/afribert-kenya-adapted")
|
| 453 |
+
|
| 454 |
+
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
|
| 455 |
+
|
| 456 |
+
# Multiple [MASK] tokens β pipeline returns a list of lists, one per mask position
|
| 457 |
+
results = fill_mask(
|
| 458 |
+
f"Oya, twendeni zetu, kuna {tokenizer.mask_token} flani ameniudhi. "
|
| 459 |
+
f"Uyo msee ameniibia {tokenizer.mask_token} zangu mingi sana nikimpata "
|
| 460 |
+
f"{tokenizer.mask_token} sana, hadi atawacha kunibeba ufala."
|
| 461 |
+
)
|
| 462 |
+
|
| 463 |
+
for mask_predictions in results:
|
| 464 |
+
print("--- New Mask ---")
|
| 465 |
+
for r in mask_predictions:
|
| 466 |
+
print(f"{r['token_str']:<20} {r['score']:.3f}")
|
| 467 |
+
```
|
| 468 |
+
|
| 469 |
### As a base model for fine-tuning (jenga_ai SDK)
|
| 470 |
|
| 471 |
```yaml
|
|
|
|
| 511 |
|
| 512 |
## Author
|
| 513 |
|
| 514 |
+
**Rogendo** β built as part of the JengaAI CPIMS NLP pipeline for Kenyan child-protection support systems.
|