Rogendo commited on
Commit
ec0fbce
Β·
verified Β·
1 Parent(s): 6793d95

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +224 -3
README.md CHANGED
@@ -11,12 +11,15 @@ tags:
11
  - cpims
12
  - african-nlp
13
  - afriberta
 
14
  base_model: castorini/afriberta_large
15
  license: apache-2.0
16
  metrics:
17
  - perplexity
18
  library_name: transformers
19
  ---
 
 
20
 
21
  # AfriBERT Kenya β€” Domain-Adapted Language Model
22
 
@@ -230,21 +233,239 @@ Compared to the previous version trained on `distilbert-base-uncased` with 271 r
230
 
231
  ---
232
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
233
  ## Usage
234
 
 
 
235
  ```python
236
  from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
237
 
238
  tokenizer = AutoTokenizer.from_pretrained("Rogendo/afribert-kenya-adapted")
239
  model = AutoModelForMaskedLM.from_pretrained("Rogendo/afribert-kenya-adapted")
240
 
241
- # Masked token prediction
242
  fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
243
- results = fill_mask("Msee alikuwa poa sana, akanisaidia kupata [MASK] ya ofisi.")
 
 
244
  for r in results:
245
  print(f"{r['token_str']:<20} {r['score']:.3f}")
246
  ```
247
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
248
  ### As a base model for fine-tuning (jenga_ai SDK)
249
 
250
  ```yaml
@@ -290,4 +511,4 @@ If you use this model, please cite the base model:
290
 
291
  ## Author
292
 
293
- **Rogendo** β€” built as part of the JengaAI CPIMS NLP pipeline for Kenyan child-protection support systems.
 
11
  - cpims
12
  - african-nlp
13
  - afriberta
14
+
15
  base_model: castorini/afriberta_large
16
  license: apache-2.0
17
  metrics:
18
  - perplexity
19
  library_name: transformers
20
  ---
21
+ license: apache-2.0
22
+ ---
23
 
24
  # AfriBERT Kenya β€” Domain-Adapted Language Model
25
 
 
233
 
234
  ---
235
 
236
+ ## Use Cases & Practical Domains
237
+
238
+ This model is designed for any NLP task involving Kenyan language text. It provides a stronger starting point than a generic multilingual model wherever the input contains Swahili, Sheng, code-switching, or Kenyan institutional vocabulary.
239
+
240
+ ### 1. Child Protection & Social Work (CPIMS)
241
+
242
+ The primary motivation for this model. Kenya's Child Protection Information Management System (CPIMS) generates a high volume of support requests, case notes, and field reports written by social workers, case managers, and NGO staff β€” often in a mix of English, Swahili, and Sheng.
243
+
244
+ **Practical tasks:**
245
+
246
+ | Task | Description | Example input |
247
+ |---|---|---|
248
+ | **Help-desk intent classification** | Route incoming support messages to the correct team or knowledge base article | *"Siwezi kuingia system, password yangu imekwisha"* β†’ `PasswordReset` |
249
+ | **Urgency triage** | Flag messages that need immediate human escalation (child at risk, abuse, missing child) | *"Mtoto amekimbia safe house usiku huu"* β†’ `urgent` |
250
+ | **Case note sentiment** | Detect frustration or distress in field worker messages to trigger supervisor review | *"Nimejaribu mara nyingi kupata msaada lakini hakuna anayejibu"* β†’ `negative` |
251
+ | **Entity extraction (NER)** | Extract names, locations, case IDs, and child ages from free-text case notes | *"Amina, miaka 9, Kibera, Case ID CP-2024-0471"* |
252
+ | **Automated case routing** | Predict which department or OVC program a case should be assigned to | Based on case note text |
253
+
254
+ ---
255
+
256
+ ### 2. Financial Services & M-PESA
257
+
258
+ M-PESA is Kenya's dominant mobile money platform used by over 30 million Kenyans. Customer support queries, fraud reports, and transaction disputes are frequently written in Swahili or code-switched language that generic models mishandle.
259
+
260
+ **Practical tasks:**
261
+
262
+ | Task | Description | Example input |
263
+ |---|---|---|
264
+ | **Transaction dispute classification** | Categorise dispute type: wrong number, reversal, Fuliza, till payment, paybill | *"Nilituma pesa nambari mbaya, naomba reverse"* |
265
+ | **Fraud signal detection** | Detect social-engineering scripts, phishing attempts, SIM-swap language | *"Uko na nambari ya siri ya M-PESA? Niambie utatumia"* |
266
+ | **Customer sentiment analysis** | Measure customer satisfaction from M-PESA helpline transcripts | Post-interaction classification |
267
+ | **FAQ intent matching** | Match a customer query to the nearest self-service FAQ answer | Semantic similarity over a FAQ corpus |
268
+ | **Agent response quality scoring** | Score whether a customer service agent's response was appropriate | Given query + response pairs |
269
+
270
+ ---
271
+
272
+ ### 3. Healthcare & Community Health Workers (CHWs)
273
+
274
+ Community Health Workers in Kenya file visit reports and referral notes, often verbally transcribed or typed on low-end phones in mixed Swahili/English.
275
+
276
+ **Practical tasks:**
277
+
278
+ | Task | Description | Example input |
279
+ |---|---|---|
280
+ | **Symptom extraction** | Extract reported symptoms from CHW visit notes | *"Mtoto ana homa kali na kukohoa sana tangu jana"* |
281
+ | **Referral urgency classification** | Triage referral notes: emergency, routine, follow-up | *"Mama mjamzito ana maumivu makali, nahitaji ambulance sasa"* β†’ `emergency` |
282
+ | **Facility routing** | Predict whether a patient should go to dispensary, health centre, or county hospital | Based on symptom description |
283
+ | **Health campaign text classification** | Classify community feedback on health campaigns (vaccination, family planning) | SMS/WhatsApp response categorisation |
284
+
285
+ ---
286
+
287
+ ### 4. Education & EdTech
288
+
289
+ Kenya's education sector uses a blend of English instruction and Swahili explanation, especially in lower grades. Many EdTech platforms serving rural Kenya receive student questions in Sheng or code-switched text.
290
+
291
+ **Practical tasks:**
292
+
293
+ | Task | Description | Example input |
294
+ |---|---|---|
295
+ | **Student question topic classification** | Route a question to the right subject tutor or resource | *"Sijui kusolve equation hii, pia sina calculator"* |
296
+ | **Learner frustration detection** | Flag messages indicating confusion or disengagement | *"Sielewi hata kidogo, imefail mara tatu"* |
297
+ | **Automatic feedback categorisation** | Classify teacher or parent feedback on school platforms | SMS / app reviews |
298
+ | **Readability scoring** | Score educational content for appropriateness at different grade levels | Given a paragraph of Swahili text |
299
+
300
+ ---
301
+
302
+ ### 5. Government & Civic Services
303
+
304
+ Kenya's e-citizen platforms, county service desks, and public feedback systems receive queries and complaints in everyday Kenyan language.
305
+
306
+ **Practical tasks:**
307
+
308
+ | Task | Description | Example input |
309
+ |---|---|---|
310
+ | **Service request classification** | Route citizen petitions/complaints to the correct county department | *"Barabara ya kwetu ina mashimo makubwa sana, lini mtarekebisha?"* |
311
+ | **Complaint sentiment & severity** | Detect strongly negative or potentially viral citizen complaints | Social media monitoring |
312
+ | **Language identification** | Detect whether a message is Swahili, Sheng, English, or code-switched | Pre-routing in multi-language systems |
313
+ | **Policy document Q&A** | Answer questions grounded in Swahili government policy documents | Retrieval-augmented generation (RAG) with this encoder |
314
+
315
+ ---
316
+
317
+ ### 6. Media, Social Listening & Misinformation
318
+
319
+ Twitter/X, Facebook, and WhatsApp in Kenya carry a large volume of Kenyan Sheng and code-switched content that standard multilingual models struggle to classify.
320
+
321
+ **Practical tasks:**
322
+
323
+ | Task | Description | Example input |
324
+ |---|---|---|
325
+ | **Hate speech / harmful content detection** | Detect Sheng-coded hate speech or incitement that generic models miss | Election-period social media moderation |
326
+ | **Rumour / misinformation flagging** | Classify claims as verified, unverified, or disputed | WhatsApp forward monitoring |
327
+ | **Topic classification** | Assign news articles or social posts to categories (politics, economy, sports, health) | Media monitoring dashboards |
328
+ | **Sentiment analysis** | Measure public sentiment on policy announcements, brands, or events | Code-switched Twitter/X data |
329
+
330
+ ---
331
+
332
+ ## Fine-tuning Guide
333
+
334
+ This model can be fine-tuned with as few as **200–500 labelled examples** per class for most classification tasks, because DAPT has already adapted the internal representations to the target domain.
335
+
336
+ ### Recommended fine-tuning tasks by architecture
337
+
338
+ | Architecture | Suitable for | HuggingFace class |
339
+ |---|---|---|
340
+ | Sequence classification | Intent, sentiment, urgency, topic, routing | `AutoModelForSequenceClassification` |
341
+ | Token classification | NER (names, locations, case IDs, symptoms) | `AutoModelForTokenClassification` |
342
+ | Multi-task (shared encoder + multiple heads) | Intent + urgency simultaneously | Custom (see jenga_ai SDK) |
343
+ | Question answering | Policy/FAQ grounding | `AutoModelForQuestionAnswering` |
344
+ | Sentence similarity | Semantic search, FAQ matching | Add a pooling head + contrastive loss |
345
+
346
+ ### Minimum data guidelines
347
+
348
+ | Task complexity | Approx. labelled examples needed |
349
+ |---|---|
350
+ | Binary classification (2 classes) | 100–300 per class |
351
+ | Multi-class (5–15 classes) | 150–400 per class |
352
+ | Multi-class (15–63 classes) | 200–500 per class |
353
+ | NER (token-level) | 500–1,000 sentences with full annotation |
354
+ | Multi-task (2 heads) | Same as above per task head |
355
+
356
+ *These estimates are based on domain-adapted models. A generic multilingual base model would need 3–5Γ— more data to reach equivalent performance on Kenyan text.*
357
+
358
+ ### Fine-tuning with HuggingFace Trainer
359
+
360
+ ```python
361
+ from transformers import (
362
+ AutoTokenizer,
363
+ AutoModelForSequenceClassification,
364
+ TrainingArguments,
365
+ Trainer,
366
+ )
367
+
368
+ model_name = "Rogendo/afribert-kenya-adapted"
369
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
370
+ model = AutoModelForSequenceClassification.from_pretrained(
371
+ model_name, num_labels=3 # e.g. urgency: low / medium / high
372
+ )
373
+
374
+ training_args = TrainingArguments(
375
+ output_dir = "my-kenya-classifier",
376
+ num_train_epochs = 5,
377
+ per_device_train_batch_size = 16,
378
+ learning_rate = 2e-5, # standard fine-tuning LR
379
+ warmup_ratio = 0.1,
380
+ evaluation_strategy = "epoch",
381
+ save_strategy = "epoch",
382
+ load_best_model_at_end = True,
383
+ bf16 = True, # use bf16 on A100/A40/H100
384
+ )
385
+
386
+ trainer = Trainer(
387
+ model = model,
388
+ args = training_args,
389
+ train_dataset = train_dataset,
390
+ eval_dataset = eval_dataset,
391
+ processing_class = tokenizer,
392
+ )
393
+ trainer.train()
394
+ ```
395
+
396
+ ### Fine-tuning with jenga_ai SDK (multi-task)
397
+
398
+ ```yaml
399
+ # cpims_config.yaml
400
+ model:
401
+ base_model: Rogendo/afribert-kenya-adapted
402
+ max_seq_len: 128
403
+
404
+ tasks:
405
+ - name: intent
406
+ task_type: multi_class_classification
407
+ num_labels: 63
408
+ label_column: intent
409
+
410
+ - name: urgency
411
+ task_type: multi_class_classification
412
+ num_labels: 3
413
+ label_column: urgency
414
+
415
+ training:
416
+ epochs: 5
417
+ batch_size: 16
418
+ learning_rate: 2.0e-5
419
+ output_dir: results/cpims-v2
420
+ ```
421
+
422
+ ```bash
423
+ python -m jenga_ai train --config cpims_config.yaml
424
+ ```
425
+
426
+ ---
427
+
428
  ## Usage
429
 
430
+ ### Single mask prediction
431
+
432
  ```python
433
  from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
434
 
435
  tokenizer = AutoTokenizer.from_pretrained("Rogendo/afribert-kenya-adapted")
436
  model = AutoModelForMaskedLM.from_pretrained("Rogendo/afribert-kenya-adapted")
437
 
 
438
  fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
439
+
440
+ # Real Sheng sentence β€” single mask
441
+ results = fill_mask(f"Oya, twendeni zetu, kuna {tokenizer.mask_token} flani ameniudhi. Uyo msee aliiba doh zangu most.")
442
  for r in results:
443
  print(f"{r['token_str']:<20} {r['score']:.3f}")
444
  ```
445
 
446
+ ### Multiple masks (one position at a time)
447
+
448
+ ```python
449
+ from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
450
+
451
+ tokenizer = AutoTokenizer.from_pretrained("Rogendo/afribert-kenya-adapted")
452
+ model = AutoModelForMaskedLM.from_pretrained("Rogendo/afribert-kenya-adapted")
453
+
454
+ fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
455
+
456
+ # Multiple [MASK] tokens β€” pipeline returns a list of lists, one per mask position
457
+ results = fill_mask(
458
+ f"Oya, twendeni zetu, kuna {tokenizer.mask_token} flani ameniudhi. "
459
+ f"Uyo msee ameniibia {tokenizer.mask_token} zangu mingi sana nikimpata "
460
+ f"{tokenizer.mask_token} sana, hadi atawacha kunibeba ufala."
461
+ )
462
+
463
+ for mask_predictions in results:
464
+ print("--- New Mask ---")
465
+ for r in mask_predictions:
466
+ print(f"{r['token_str']:<20} {r['score']:.3f}")
467
+ ```
468
+
469
  ### As a base model for fine-tuning (jenga_ai SDK)
470
 
471
  ```yaml
 
511
 
512
  ## Author
513
 
514
+ **Rogendo** β€” built as part of the JengaAI CPIMS NLP pipeline for Kenyan child-protection support systems.