Create README.md

Browse files

Files changed (1) hide show

README.md +506 -0

README.md ADDED Viewed

	@@ -0,0 +1,506 @@

+---
+language:
+- en
+- sw
+tags:
+- multi-task-learning
+- text-classification
+- fraud-detection
+- sentiment-analysis
+- call-quality
+- question-answering
+- jenga-ai
+- nlp-for-africa
+- security
+- attention-fusion
+base_model: distilbert-base-uncased
+license: apache-2.0
+pipeline_tag: text-classification
+datasets:
+- custom
+model-index:
+- name: JengaAI-multi-task-nlp
+  results:
+  - task:
+      type: text-classification
+      name: Fraud Detection
+    metrics:
+    - type: f1
+      value: 1
+      name: F1
+    - type: accuracy
+      value: 1
+      name: Accuracy
+  - task:
+      type: text-classification
+      name: Sentiment Analysis
+    metrics:
+    - type: f1
+      value: 0.167
+      name: F1
+    - type: accuracy
+      value: 0.333
+      name: Accuracy
+  - task:
+      type: text-classification
+      name: Call Quality - Listening
+    metrics:
+    - type: f1
+      value: 0.922
+      name: F1
+  - task:
+      type: text-classification
+      name: Call Quality - Resolution
+    metrics:
+    - type: f1
+      value: 0.908
+      name: F1
+widget:
+- text: >-
+    Suspicious M-Pesa transaction detected from unknown account requesting
+    urgent transfer
+  example_title: Fraud Detection
+- text: >-
+    The customer service was excellent, my billing issue was resolved on the
+    first call
+  example_title: Positive Sentiment
+- text: Hello, welcome to Safaricom customer care. How can I assist you today?
+  example_title: Call Quality Scoring
+library_name: transformers
+---
+# JengaAI Multi-Task NLP (3-Task Attention Fusion)
+A **multi-task NLP model** built with the [JengaAI framework](https://github.com/Rogendo/JengaAI) that performs **fraud detection**, **sentiment analysis**, and **call quality scoring** simultaneously through a shared encoder with attention-based task fusion. Designed for Kenyan national security and telecommunications applications.
+## Model Capabilities
+This model handles **3 tasks** with **8 prediction heads** producing **22 total output dimensions** in a single forward pass:
+| Task | Type | Heads | Outputs | Best F1 |
+|:-----|:-----|:------|:--------|:--------|
+| **Fraud Detection** | Binary classification | 1 (fraud) | 2 classes: normal / fraud | **1.000** |
+| **Sentiment Analysis** | 3-class classification | 1 (sentiment) | 3 classes: negative / neutral / positive | 0.167 |
+| **Call Quality Scoring** | Multi-label QA | 6 heads, 17 sub-metrics | Binary per sub-metric | **0.646 - 0.967** |
+### Call Quality Sub-Metrics (17 Binary Outputs)
+The call quality task evaluates customer service transcripts across 6 quality dimensions:
+| Head | Sub-Metrics | F1 |
+|:-----|:-----------|:---|
+| **Opening** | greeting | 0.967 |
+| **Listening** | acknowledgment, empathy, clarification, active_listening, patience | 0.922 |
+| **Proactiveness** | initiative, follow_up, suggestions | 0.802 |
+| **Resolution** | identified_issue, provided_solution, confirmed_resolution, set_expectations, offered_alternatives | 0.908 |
+| **Hold** | asked_permission, explained_reason | 0.647 |
+| **Closing** | proper_farewell | 0.881 |
+## Architecture
+```
+Input Text
+    |
+    v
+[DistilBERT Encoder] ---- 6 layers, 768 hidden, 12 attention heads
+    |
+    v
+[Attention Fusion] ------- task-conditioned attention with residual connections
+    |
+    +-- [Task 0: Fraud Head] ----------- Linear(768, 2) --> softmax
+    +-- [Task 1: Sentiment Head] ------- Linear(768, 3) --> softmax
+    +-- [Task 2: QA Scoring 6 Heads] --- 6x Linear(768, 1..5) --> sigmoid
+```
+**Key design choices:**
+- **Shared encoder**: All 3 tasks share a single DistilBERT encoder, enabling knowledge transfer between fraud patterns, sentiment signals, and call quality indicators
+- **Attention fusion**: A learned attention mechanism modulates the shared representation per task, allowing each task to attend to different parts of the encoder output while still benefiting from shared features
+- **Residual connections**: Fusion output is added to the original representation (gate_init_value=0.5), ensuring stable training and allowing each task to fall back on the base representation
+- **Multi-head QA**: Call quality uses 6 independent classification heads with different output sizes (1-5 binary outputs each), weighted by importance during training (resolution: 2.0x, listening: 1.5x, hold: 0.5x)
+## Usage
+### With JengaAI Framework (Recommended)
+```bash
+pip install torch transformers pydantic pyyaml huggingface_hub
+```
+```python
+from huggingface_hub import snapshot_download
+from jenga_ai.inference import InferencePipeline
+# Download model
+model_path = snapshot_download(
+    "Rogendo/JengaAI-multi-task-nlp",
+    ignore_patterns=["checkpoints/*", "logs/*"],
+)
+# Load pipeline
+pipeline = InferencePipeline.from_checkpoint(
+    model_dir=model_path,
+    config_path=f"{model_path}/experiment_config.yaml",
+    device="auto",
+)
+# Run all 3 tasks at once
+result = pipeline.predict("Suspicious M-Pesa transaction from unknown account")
+print(result.to_json())
+# Or run a single task
+fraud_result = pipeline.predict(
+    "WARNING: Your Safaricom account has been compromised. Send 5000 KES to unlock.",
+    task_name="fraud_detection",
+)
+fraud = fraud_result.task_results["fraud_detection"].heads["fraud"]
+print(f"Fraud: {fraud.prediction} (confidence: {fraud.confidence:.1%})")
+# Fraud: 1 (confidence: 96.9%)
+```
+### Batch Inference
+```python
+texts = [
+    "Suspicious M-Pesa notification asking me to send money.",
+    "Normal airtime top-up of 100 KES via M-Pesa.",
+    "WARNING: Your account has been compromised.",
+]
+results = pipeline.predict_batch(texts, task_name="fraud_detection", batch_size=32)
+for text, result in zip(texts, results):
+    fraud = result.task_results["fraud_detection"].heads["fraud"]
+    label = "FRAUD" if fraud.prediction == 1 else "LEGIT"
+    print(f"[{label} {fraud.confidence:.1%}] {text}")
+```
+### CLI
+```bash
+# Single text
+python -m jenga_ai predict \
+    --config experiment_config.yaml \
+    --model-dir ./model \
+    --text "Suspicious M-Pesa transaction from unknown account" \
+    --format report
+# Batch from file
+python -m jenga_ai predict \
+    --config experiment_config.yaml \
+    --model-dir ./model \
+    --input-file transcripts.jsonl \
+    --output predictions.json \
+    --batch-size 16
+```
+### Call Quality Scoring Example
+```python
+result = pipeline.predict(
+    "Hello, welcome to Safaricom customer care. I understand you're having "
+    "a billing issue. Let me look into that for you right away. I've found "
+    "the discrepancy and corrected your balance. Is there anything else?",
+    task_name="call_quality",
+)
+for head_name, head in result.task_results["call_quality"].heads.items():
+    print(f"{head_name:16s} {head.prediction}  (conf: {head.confidence:.2f})")
+```
+Output:
+```
+opening          {'greeting': True}  (conf: 0.82)
+listening        {'acknowledgment': True, 'empathy': True, ...}  (conf: 0.75)
+proactiveness    {'initiative': True, 'follow_up': True, 'suggestions': False}  (conf: 0.58)
+resolution       {'identified_issue': True, 'provided_solution': True, ...}  (conf: 0.69)
+hold             {'asked_permission': False, 'explained_reason': False}  (conf: 0.02)
+closing          {'proper_farewell': True}  (conf: 0.52)
+```
+### Low-Level Usage (Without JengaAI Framework)
+If you only need the raw model weights and want to integrate into your own pipeline:
+```python
+import torch
+import json
+from transformers import AutoTokenizer, AutoModel, AutoConfig
+# Load components
+tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
+encoder_config = AutoConfig.from_pretrained("./model/encoder_config")
+with open("./model/metadata.json") as f:
+    metadata = json.load(f)
+# Load full state dict
+state_dict = torch.load("./model/model.pt", map_location="cpu", weights_only=True)
+# Extract encoder weights (keys starting with "encoder.")
+encoder_state = {k.replace("encoder.", ""): v for k, v in state_dict.items() if k.startswith("encoder.")}
+encoder = AutoModel.from_config(encoder_config)
+encoder.load_state_dict(encoder_state)
+encoder.eval()
+# Run encoder
+inputs = tokenizer("Suspicious transaction", return_tensors="pt", padding="max_length",
+                    truncation=True, max_length=256)
+with torch.no_grad():
+    outputs = encoder(**inputs)
+    cls_embedding = outputs.last_hidden_state[:, 0]  # [1, 768]
+# Extract fraud head weights (task 0, head "fraud")
+fraud_weight = state_dict["tasks.0.heads.fraud.1.weight"]  # [2, 768]
+fraud_bias = state_dict["tasks.0.heads.fraud.1.bias"]       # [2]
+logits = cls_embedding @ fraud_weight.T + fraud_bias
+probs = torch.softmax(logits, dim=-1)
+print(f"Fraud probability: {probs[0, 1].item():.4f}")
+```
+## Intended Use
+### Primary Use Cases
+- **M-Pesa Fraud Detection**: Classify M-Pesa transaction descriptions as fraudulent or legitimate. Designed for Safaricom and Kenyan mobile money contexts.
+- **Customer Sentiment Monitoring**: Analyze customer feedback and communications for sentiment polarity (negative / neutral / positive).
+- **Call Center Quality Assurance**: Score customer service call transcripts across 17 quality sub-metrics in 6 categories, replacing manual QA audits.
+- **Multi-Signal Analysis**: Run all 3 tasks simultaneously on the same text to get a comprehensive analysis (is this a fraud attempt? what's the sentiment? how good was the agent's response?).
+### Intended Users
+- Kenyan telecommunications companies (Safaricom, Airtel Kenya)
+- Financial institutions monitoring mobile money transactions
+- Call center operations teams performing quality audits
+- Security analysts processing incident reports
+- NLP researchers working on African language and context models
+### Downstream Use
+The model can be integrated into:
+- Real-time fraud alerting systems
+- Call center dashboards with automated QA scoring
+- Customer feedback analysis pipelines
+- Security operations center (SOC) threat triage workflows
+- Mobile money transaction monitoring platforms
+## Out-of-Scope Use
+- **Not for automated decision-making without human oversight.** This model should support human analysts, not replace them. High-stakes fraud decisions require human review.
+- **Not for non-Kenyan contexts without retraining.** Entity names, transaction patterns, and call center norms are Kenyan-specific.
+- **Not for languages other than English.** While some Swahili words appear in the training data (M-Pesa, Safaricom, KRA), the model is primarily English.
+- **Not for legal evidence.** Model outputs are analytical signals, not forensic evidence.
+- **Not for surveillance of individuals.** The model analyzes text content, not identity.
+## Bias, Risks, and Limitations
+### Known Biases
+- **Training data imbalance**: Fraud detection was trained on only 20 samples (16 train / 4 eval). The model achieves 1.0 F1 on eval but this is likely due to the tiny eval set and potential overfitting. Real-world fraud patterns are far more diverse.
+- **Sentiment data**: Only 15 samples, with accuracy stuck at 33.3% (random baseline for 3 classes). The sentiment head needs significantly more training data to be production-useful.
+- **Call quality data**: 4,996 synthetic transcripts. While metrics are strong (0.65-0.97 F1), the synthetic nature means real-world transcripts with noise, code-switching (Swahili-English), and non-standard grammar may perform differently.
+- **Geographic bias**: All training data reflects Kenyan contexts. The model may not generalize to other East African countries without adaptation.
+### Risks
+- **False positives in fraud detection**: Legitimate transactions flagged as fraud can block real users. Always use this model with human review for enforcement actions.
+- **False negatives in fraud detection**: Sophisticated fraud patterns not in the training data will be missed. This model is one signal among many, not a standalone detector.
+- **Over-reliance on QA scores**: Call quality scores should augment, not replace, human QA reviewers. Edge cases (cultural nuances, sarcasm, escalation scenarios) may be scored incorrectly.
+### Recommendations
+- Use fraud detection as a **triage signal** (flag for review), not an automatic block
+- Retrain with production-scale data before deploying to production
+- Monitor prediction confidence — route low-confidence predictions to human review using the built-in HITL routing (`enable_hitl=True`)
+- Enable PII redaction (`enable_pii=True`) when processing real customer data
+- Enable audit logging (`enable_audit=True`) for compliance and accountability
+## Training Details
+### Training Data
+| Dataset | Task | Samples | Source |
+|:--------|:-----|:--------|:-------|
+| `sample_classification.jsonl` | Fraud Detection | 20 | Synthetic M-Pesa transaction descriptions |
+| `sample_sentiment.jsonl` | Sentiment Analysis | 15 | Synthetic customer feedback |
+| `synthetic_qa_metrics_data_v01x.json` | Call Quality | 4,996 | Synthetic call center transcripts with 17 binary QA labels |
+**Train/eval split**: 80/20 random split (seed=42)
+All datasets are synthetic, generated to reflect linguistic patterns in Kenyan telecommunications and financial services contexts. They contain English text with occasional Swahili terms and Kenyan-specific entities (M-Pesa, Safaricom, KRA, Kenyan phone numbers).
+### Training Procedure
+#### Preprocessing
+- Tokenizer: `distilbert-base-uncased` WordPiece tokenizer
+- Max sequence length: 256 tokens
+- Padding: `max_length` (padded to 256)
+- Truncation: enabled
+#### Architecture
+- **Encoder**: DistilBERT (6 layers, 768 hidden, 12 heads) — 66.4M parameters
+- **Fusion**: Attention fusion with residual connections — 1.2M parameters
+- **Task heads**: 8 linear heads across 3 tasks — 17K parameters
+- **Total**: 67.6M parameters (258MB on disk)
+#### Training Hyperparameters
+| Parameter | Value |
+|:----------|:------|
+| Learning rate | 2e-5 |
+| Batch size | 16 |
+| Epochs | 12 (best checkpoint at epoch 3) |
+| Weight decay | 0.01 |
+| Warmup steps | 20 |
+| Max gradient norm | 1.0 |
+| Optimizer | AdamW |
+| Precision | FP32 |
+| Task sampling | Proportional (temperature=2.0) |
+| Early stopping patience | 5 epochs |
+| Best model metric | eval_loss |
+#### Task Loss Weights
+| Head | Weight | Rationale |
+|:-----|:-------|:----------|
+| fraud | 1.0 | Standard |
+| sentiment | 1.0 | Standard |
+| opening | 1.0 | Standard |
+| listening | 1.5 | Important quality dimension |
+| proactiveness | 1.0 | Standard |
+| resolution | 2.0 | Most critical quality dimension |
+| hold | 0.5 | Less frequent in transcripts |
+| closing | 1.0 | Standard |
+#### Training Loss Progression
+| Epoch | Train Loss | Eval Loss | Status |
+|:------|:-----------|:----------|:-------|
+| 3 | 1.878 | **1.948** | Best checkpoint |
+| 7 | 1.471 | 2.057 | Overfitting begins |
+| 8 | 1.403 | 2.068 | Continued overfitting |
+The best checkpoint was selected at epoch 3 based on eval_loss. Training continued to epoch 12 but eval loss increased after epoch 3, indicating overfitting �� expected given the small fraud and sentiment datasets.
+### Speeds, Sizes, Times
+| Metric | Value |
+|:-------|:------|
+| Model size (disk) | 258 MB |
+| Parameters | 67.6M |
+| Inference latency (single task, CPU) | ~590 ms |
+| Inference latency (all 3 tasks, CPU) | ~1,960 ms |
+| Batch throughput (32 texts, single task, CPU) | ~647 ms/sample |
+| Training time | ~5 minutes (CPU, 12 epochs) |
+## Evaluation
+### Metrics
+All metrics are computed on the 20% held-out eval split.
+**Fraud Detection** (binary classification):
+| Metric | Value |
+|:-------|:------|
+| Accuracy | 1.000 |
+| Precision | 1.000 |
+| Recall | 1.000 |
+| F1 | 1.000 |
+**Sentiment Analysis** (3-class classification):
+| Metric | Value |
+|:-------|:------|
+| Accuracy | 0.333 |
+| Precision | 0.111 |
+| Recall | 0.333 |
+| F1 | 0.167 |
+**Call Quality** (multi-label binary per head):
+| Head | Precision | Recall | F1 |
+|:-----|:----------|:-------|:---|
+| Opening | 0.967 | 0.967 | **0.967** |
+| Listening | 0.893 | 0.953 | **0.922** |
+| Proactiveness | 0.746 | 0.868 | **0.802** |
+| Resolution | 0.918 | 0.898 | **0.908** |
+| Hold | 0.856 | 0.519 | **0.647** |
+| Closing | 0.881 | 0.881 | **0.881** |
+### Results Summary
+- **Fraud detection** achieves perfect metrics on the eval set, but this is a very small eval set (4 samples). Production deployment requires evaluation on a larger, more diverse dataset.
+- **Sentiment analysis** performs at random baseline (33.3% accuracy for 3 classes), indicating the 15-sample dataset is insufficient. This head needs retraining with production data.
+- **Call quality** shows strong performance across most heads (0.80-0.97 F1), with the "hold" category being the weakest (0.647 F1) due to fewer hold-related examples in the training data.
+## Model Examination
+### Attention Fusion
+The attention fusion mechanism learns task-specific attention patterns over the shared encoder output. This allows:
+- The fraud head to attend to transaction-related tokens (amounts, account references)
+- The sentiment head to attend to opinion-bearing words
+- The QA heads to attend to conversational flow patterns
+The fusion uses a gated residual connection (initialized at 0.5), meaning each task's representation is a learned blend of the task-specific attended output and the original encoder output.
+### Security Features
+When used with the JengaAI inference framework, the model supports:
+- **PII Redaction**: Masks Kenyan-specific PII (phone numbers, national IDs, KRA PINs, M-Pesa transaction IDs) before inference
+- **Explainability**: Token-level importance scores via attention analysis or gradient methods
+- **Human-in-the-Loop**: Automatic routing of low-confidence predictions to human reviewers based on entropy-based uncertainty estimation
+- **Audit Trail**: Tamper-evident logging of every inference call with SHA-256 hash chains
+## Technical Specifications
+### Model Architecture and Objective
+- **Architecture**: DistilBERT encoder + attention fusion + multi-task heads
+- **Encoder**: 6 transformer layers, 768 hidden size, 12 attention heads, 30,522 vocab
+- **Fusion**: Single-head attention with residual gating
+- **Objectives**: CrossEntropy (fraud, sentiment) + BCEWithLogits (call quality)
+### Compute Infrastructure
+#### Hardware
+- Training: CPU (Intel/AMD, standard workstation)
+- Inference: CPU or CUDA GPU
+#### Software
+- PyTorch 2.x
+- Transformers 5.x
+- JengaAI Framework V2
+- Python 3.11+
+## Environmental Impact
+- **Hardware Type**: CPU (standard workstation)
+- **Training Time**: ~5 minutes
+- **Carbon Emitted**: Negligible (short training run on CPU)
+## Citation
+```bibtex
+@software{jengaai2026,
+  title = {JengaAI: Low-Code Multi-Task NLP for African Security Applications},
+  author = {Rogendo},
+  year = {2026},
+  url = {https://huggingface.co/Rogendo/JengaAI-multi-task-nlp},
+}
+```
+## Model Card Authors
+Rogendo
+## Model Card Contact
+For questions, issues, or contributions: [GitHub Issues](https://github.com/Rogendo/JengaAI/issues)