Attention Fusion Multi-Task DistilBERT

A single DistilBERT-base-cased backbone trained simultaneously on three NLP tasks for child helpline conversation analysis, using per-task learned attention pooling heads.

Task	Type	Heads
Named Entity Recognition (NER)	Token-level	10 entity labels
Case Classification (CLS)	Sentence-level	4 heads (main category, sub-category, intervention, priority)
Quality Assurance Scoring (QA)	Sentence-level	6 binary heads (17 sub-metrics)

Validation Metrics (best checkpoint)

Metric	Value
NER macro F1	0.5343
CLS average accuracy (4 heads)	0.6183
QA average micro-F1 (6 heads)	0.8386
Composite average	0.6638

Usage

Install dependencies

pip install torch transformers huggingface_hub

Download and run inference

from huggingface_hub import snapshot_download
import json, torch

# Download all files from the Hub
model_dir = snapshot_download(repo_id="rogendo/attention-fusion-distilbert")

from inference import AttentionFusionInference

inf = AttentionFusionInference(model_dir=model_dir, device="cuda")

texts = [
    "Hello, I'm calling from Nairobi. My daughter Sarah, aged 12, was assaulted by her teacher.",
]

# Named Entity Recognition
ner_results = inf.predict_ner(texts)
# → [[("Sarah", "NAME"), ("12", "AGE"), ("Nairobi", "LOCATION"), ...]]

# Case Classification
cls_results = inf.predict_classification(texts)
# → [{"main_category": "VANE", "sub_category": "Physical Abuse",
#      "intervention": "Counselling", "priority": "1"}]

# Quality Assurance Scoring
qa_results = inf.predict_qa(texts)
# → [{"opening": [1], "listening": [1,0,1,1,0], ...}]

Architecture

Input Text → Tokenizer → DistilBERT-base-cased (shared backbone)
                                    │
              ┌─────────────────────┼──────────────────────┐
              │                     │                      │
           NER Head             CLS Head               QA Head
       Linear(768→768)      TaskAttnPooling        TaskAttnPooling
        + GELU + Drop           + Dropout              + Dropout
       Linear(768→10)       4 classifiers          6 binary heads
        [per token]         (main/sub/interv/       (open/listen/
       CrossEntropy          priority)               proact/resolv/
       ignore=-100          CrossEntropy×4           hold/close)
                            ignore=-1               BCEWithLogits×6

TaskAttentionPooling replaces the fixed [CLS] token with a learned per-task attention weighted sum over all token positions, so each task can focus on the tokens most relevant to its objective.

Training Strategy — Task Alternation

Each epoch, all batches from all three DataLoaders are collected, tagged with their task name, and randomly shuffled into a single sequence. The model iterates through this shuffled list — giving the shared backbone continuous gradient signal from all three tasks and preventing catastrophic forgetting.

NER Labels (10 classes)

O, NAME, LOCATION, VICTIM, AGE, GENDER, INCIDENT_TYPE, PERPETRATOR, PHONE_NUMBER, LANDMARK

Classification Labels

Main categories (8): Advice and Counselling, Child Maintenance & Custody, Disability, GBV, Information, Nutrition, Unknown, VANE

Interventions (16): Awareness/Information Provided, Counseling, Counseling, Referral, Counselling, Counselling Referral, Counselling Referral Signposting, Counselling, Awareness/Information Provided, Counselling, Awareness/Information Provided, Signposting, Counselling, Referral, Counselling, Referral, Signposting, Counselling, Signposting, Information Provided, Counselling, Referral, Referral, Counselling, Referral, Signposting, Signposting

Priority: 1 (critical) · 2 (urgent) · 3 (routine)

QA Scoring Heads

Citation

If you use this model, please cite the training repository:

@misc{attention-fusion-distilbert,
  title  = {Attention Fusion Multi-Task DistilBERT for Child Helpline Analysis},
  year   = {2025},
  url    = {https://huggingface.co/rogendo/attention-fusion-distilbert}
}

Downloads last month: -; Downloads are not tracked for this model. How to track