Update README.md

Browse files

Files changed (1) hide show

README.md +224 -3

README.md CHANGED Viewed

@@ -11,12 +11,15 @@ tags:
 - cpims
 - african-nlp
 - afriberta
 base_model: castorini/afriberta_large
 license: apache-2.0
 metrics:
 - perplexity
 library_name: transformers
 ---
 # AfriBERT Kenya — Domain-Adapted Language Model
@@ -230,21 +233,239 @@ Compared to the previous version trained on `distilbert-base-uncased` with 271 r
 ---
 ## Usage
 ```python
 from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
 tokenizer = AutoTokenizer.from_pretrained("Rogendo/afribert-kenya-adapted")
 model = AutoModelForMaskedLM.from_pretrained("Rogendo/afribert-kenya-adapted")
-# Masked token prediction
 fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
-results = fill_mask("Msee alikuwa poa sana, akanisaidia kupata [MASK] ya ofisi.")
 for r in results:
     print(f"{r['token_str']:<20} {r['score']:.3f}")
 ```
 ### As a base model for fine-tuning (jenga_ai SDK)
 ```yaml
@@ -290,4 +511,4 @@ If you use this model, please cite the base model:
 ## Author
-**Rogendo** — built as part of the JengaAI CPIMS NLP pipeline for Kenyan child-protection support systems.

 - cpims
 - african-nlp
 - afriberta
 base_model: castorini/afriberta_large
 license: apache-2.0
 metrics:
 - perplexity
 library_name: transformers
 ---
+license: apache-2.0
+---
 # AfriBERT Kenya — Domain-Adapted Language Model
 ---
+## Use Cases & Practical Domains
+This model is designed for any NLP task involving Kenyan language text. It provides a stronger starting point than a generic multilingual model wherever the input contains Swahili, Sheng, code-switching, or Kenyan institutional vocabulary.
+### 1. Child Protection & Social Work (CPIMS)
+The primary motivation for this model. Kenya's Child Protection Information Management System (CPIMS) generates a high volume of support requests, case notes, and field reports written by social workers, case managers, and NGO staff — often in a mix of English, Swahili, and Sheng.
+**Practical tasks:**
+| Task | Description | Example input |
+|---|---|---|
+| **Help-desk intent classification** | Route incoming support messages to the correct team or knowledge base article | *"Siwezi kuingia system, password yangu imekwisha"* → `PasswordReset` |
+| **Urgency triage** | Flag messages that need immediate human escalation (child at risk, abuse, missing child) | *"Mtoto amekimbia safe house usiku huu"* → `urgent` |
+| **Case note sentiment** | Detect frustration or distress in field worker messages to trigger supervisor review | *"Nimejaribu mara nyingi kupata msaada lakini hakuna anayejibu"* → `negative` |
+| **Entity extraction (NER)** | Extract names, locations, case IDs, and child ages from free-text case notes | *"Amina, miaka 9, Kibera, Case ID CP-2024-0471"* |
+| **Automated case routing** | Predict which department or OVC program a case should be assigned to | Based on case note text |
+---
+### 2. Financial Services & M-PESA
+M-PESA is Kenya's dominant mobile money platform used by over 30 million Kenyans. Customer support queries, fraud reports, and transaction disputes are frequently written in Swahili or code-switched language that generic models mishandle.
+**Practical tasks:**
+| Task | Description | Example input |
+|---|---|---|
+| **Transaction dispute classification** | Categorise dispute type: wrong number, reversal, Fuliza, till payment, paybill | *"Nilituma pesa nambari mbaya, naomba reverse"* |
+| **Fraud signal detection** | Detect social-engineering scripts, phishing attempts, SIM-swap language | *"Uko na nambari ya siri ya M-PESA? Niambie utatumia"* |
+| **Customer sentiment analysis** | Measure customer satisfaction from M-PESA helpline transcripts | Post-interaction classification |
+| **FAQ intent matching** | Match a customer query to the nearest self-service FAQ answer | Semantic similarity over a FAQ corpus |
+| **Agent response quality scoring** | Score whether a customer service agent's response was appropriate | Given query + response pairs |
+---
+### 3. Healthcare & Community Health Workers (CHWs)
+Community Health Workers in Kenya file visit reports and referral notes, often verbally transcribed or typed on low-end phones in mixed Swahili/English.
+**Practical tasks:**
+| Task | Description | Example input |
+|---|---|---|
+| **Symptom extraction** | Extract reported symptoms from CHW visit notes | *"Mtoto ana homa kali na kukohoa sana tangu jana"* |
+| **Referral urgency classification** | Triage referral notes: emergency, routine, follow-up | *"Mama mjamzito ana maumivu makali, nahitaji ambulance sasa"* → `emergency` |
+| **Facility routing** | Predict whether a patient should go to dispensary, health centre, or county hospital | Based on symptom description |
+| **Health campaign text classification** | Classify community feedback on health campaigns (vaccination, family planning) | SMS/WhatsApp response categorisation |
+---
+### 4. Education & EdTech
+Kenya's education sector uses a blend of English instruction and Swahili explanation, especially in lower grades. Many EdTech platforms serving rural Kenya receive student questions in Sheng or code-switched text.
+**Practical tasks:**
+| Task | Description | Example input |
+|---|---|---|
+| **Student question topic classification** | Route a question to the right subject tutor or resource | *"Sijui kusolve equation hii, pia sina calculator"* |
+| **Learner frustration detection** | Flag messages indicating confusion or disengagement | *"Sielewi hata kidogo, imefail mara tatu"* |
+| **Automatic feedback categorisation** | Classify teacher or parent feedback on school platforms | SMS / app reviews |
+| **Readability scoring** | Score educational content for appropriateness at different grade levels | Given a paragraph of Swahili text |
+---
+### 5. Government & Civic Services
+Kenya's e-citizen platforms, county service desks, and public feedback systems receive queries and complaints in everyday Kenyan language.
+**Practical tasks:**
+| Task | Description | Example input |
+|---|---|---|
+| **Service request classification** | Route citizen petitions/complaints to the correct county department | *"Barabara ya kwetu ina mashimo makubwa sana, lini mtarekebisha?"* |
+| **Complaint sentiment & severity** | Detect strongly negative or potentially viral citizen complaints | Social media monitoring |
+| **Language identification** | Detect whether a message is Swahili, Sheng, English, or code-switched | Pre-routing in multi-language systems |
+| **Policy document Q&A** | Answer questions grounded in Swahili government policy documents | Retrieval-augmented generation (RAG) with this encoder |
+---
+### 6. Media, Social Listening & Misinformation
+Twitter/X, Facebook, and WhatsApp in Kenya carry a large volume of Kenyan Sheng and code-switched content that standard multilingual models struggle to classify.
+**Practical tasks:**
+| Task | Description | Example input |
+|---|---|---|
+| **Hate speech / harmful content detection** | Detect Sheng-coded hate speech or incitement that generic models miss | Election-period social media moderation |
+| **Rumour / misinformation flagging** | Classify claims as verified, unverified, or disputed | WhatsApp forward monitoring |
+| **Topic classification** | Assign news articles or social posts to categories (politics, economy, sports, health) | Media monitoring dashboards |
+| **Sentiment analysis** | Measure public sentiment on policy announcements, brands, or events | Code-switched Twitter/X data |
+---
+## Fine-tuning Guide
+This model can be fine-tuned with as few as **200–500 labelled examples** per class for most classification tasks, because DAPT has already adapted the internal representations to the target domain.
+### Recommended fine-tuning tasks by architecture
+| Architecture | Suitable for | HuggingFace class |
+|---|---|---|
+| Sequence classification | Intent, sentiment, urgency, topic, routing | `AutoModelForSequenceClassification` |
+| Token classification | NER (names, locations, case IDs, symptoms) | `AutoModelForTokenClassification` |
+| Multi-task (shared encoder + multiple heads) | Intent + urgency simultaneously | Custom (see jenga_ai SDK) |
+| Question answering | Policy/FAQ grounding | `AutoModelForQuestionAnswering` |
+| Sentence similarity | Semantic search, FAQ matching | Add a pooling head + contrastive loss |
+### Minimum data guidelines
+| Task complexity | Approx. labelled examples needed |
+|---|---|
+| Binary classification (2 classes) | 100–300 per class |
+| Multi-class (5–15 classes) | 150–400 per class |
+| Multi-class (15–63 classes) | 200–500 per class |
+| NER (token-level) | 500–1,000 sentences with full annotation |
+| Multi-task (2 heads) | Same as above per task head |
+*These estimates are based on domain-adapted models. A generic multilingual base model would need 3–5× more data to reach equivalent performance on Kenyan text.*
+### Fine-tuning with HuggingFace Trainer
+```python
+from transformers import (
+    AutoTokenizer,
+    AutoModelForSequenceClassification,
+    TrainingArguments,
+    Trainer,
+)
+model_name = "Rogendo/afribert-kenya-adapted"
+tokenizer  = AutoTokenizer.from_pretrained(model_name)
+model      = AutoModelForSequenceClassification.from_pretrained(
+    model_name, num_labels=3  # e.g. urgency: low / medium / high
+)
+training_args = TrainingArguments(
+    output_dir          = "my-kenya-classifier",
+    num_train_epochs    = 5,
+    per_device_train_batch_size = 16,
+    learning_rate       = 2e-5,       # standard fine-tuning LR
+    warmup_ratio        = 0.1,
+    evaluation_strategy = "epoch",
+    save_strategy       = "epoch",
+    load_best_model_at_end = True,
+    bf16                = True,       # use bf16 on A100/A40/H100
+)
+trainer = Trainer(
+    model           = model,
+    args            = training_args,
+    train_dataset   = train_dataset,
+    eval_dataset    = eval_dataset,
+    processing_class = tokenizer,
+)
+trainer.train()
+```
+### Fine-tuning with jenga_ai SDK (multi-task)
+```yaml
+# cpims_config.yaml
+model:
+  base_model: Rogendo/afribert-kenya-adapted
+  max_seq_len: 128
+tasks:
+  - name: intent
+    task_type: multi_class_classification
+    num_labels: 63
+    label_column: intent
+  - name: urgency
+    task_type: multi_class_classification
+    num_labels: 3
+    label_column: urgency
+training:
+  epochs: 5
+  batch_size: 16
+  learning_rate: 2.0e-5
+  output_dir: results/cpims-v2
+```
+```bash
+python -m jenga_ai train --config cpims_config.yaml
+```
+---
 ## Usage
+### Single mask prediction
 ```python
 from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
 tokenizer = AutoTokenizer.from_pretrained("Rogendo/afribert-kenya-adapted")
 model = AutoModelForMaskedLM.from_pretrained("Rogendo/afribert-kenya-adapted")
 fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
+# Real Sheng sentence — single mask
+results = fill_mask(f"Oya, twendeni zetu, kuna {tokenizer.mask_token} flani ameniudhi. Uyo msee aliiba doh zangu most.")
 for r in results:
     print(f"{r['token_str']:<20} {r['score']:.3f}")
 ```
+### Multiple masks (one position at a time)
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
+tokenizer = AutoTokenizer.from_pretrained("Rogendo/afribert-kenya-adapted")
+model = AutoModelForMaskedLM.from_pretrained("Rogendo/afribert-kenya-adapted")
+fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
+# Multiple [MASK] tokens — pipeline returns a list of lists, one per mask position
+results = fill_mask(
+    f"Oya, twendeni zetu, kuna {tokenizer.mask_token} flani ameniudhi. "
+    f"Uyo msee ameniibia {tokenizer.mask_token} zangu mingi sana nikimpata "
+    f"{tokenizer.mask_token} sana, hadi atawacha kunibeba ufala."
+)
+for mask_predictions in results:
+    print("--- New Mask ---")
+    for r in mask_predictions:
+        print(f"{r['token_str']:<20} {r['score']:.3f}")
+```
 ### As a base model for fine-tuning (jenga_ai SDK)
 ```yaml
 ## Author
+**Rogendo** — built as part of the JengaAI CPIMS NLP pipeline for Kenyan child-protection support systems.