# MentalBERT V5 Cleanlab — Cooperative Source-Aware (Label-Noise Cleaned) ## Overview 8-class mental health text classifier trained on the V5 dataset after **Confident Learning** (cleanlab) label-noise removal on the Depression/Suicidal boundary. **Architecture:** Cooperative source-aware MentalBERT **Base model:** `mental/mental-bert-base-uncased` **Classes:** ['Anxiety', 'Bipolar', 'Depression', 'Directed Aggression', 'Normal', 'Personality Disorder', 'Stress', 'Suicidal'] ## Cleaning Methodology - **Tool:** cleanlab 2.6.x with `filter_by='prune_by_noise_rate'` - **OOF probabilities:** 3-fold stratified cross-validation (2 epochs/fold) - **Surgical scope:** Depression and Suicidal rows only, non-CSSRS sources only - **CSSRS excluded:** CSSRS is the clinician-annotated trust anchor (κ=0.79) — never cleaned - **Total drops:** 4,745 samples removed from train+val pool ## Performance | Metric | Value | |---|---| | Test Accuracy | 83.14% | | F1 Macro | 0.8384 | | F1 Weighted | 0.8313 | | Dep→Sui bleed | 692 | | Sui→Dep bleed | 575 | | Total bleed | 1,267 | ## Inference Usage ```python import torch, torch.nn as nn from transformers import BertModel, BertTokenizerFast from huggingface_hub import hf_hub_download import joblib, json REPO = 'itsLu/mentalbert-v5-cleanlab' bert = BertModel.from_pretrained(REPO) tokenizer = BertTokenizerFast.from_pretrained(REPO) config = json.load(open(hf_hub_download(REPO, 'inference_config.json'))) cls_head = nn.Sequential(nn.Dropout(0.1), nn.Linear(768, config['n_classes'])) cls_head.load_state_dict(torch.load(hf_hub_download(REPO, 'cls_head.pt'), map_location='cpu')) le_path = hf_hub_download(REPO, 'label_encoder.joblib') le = joblib.load(le_path) def predict(text, device='cpu'): bert.to(device).eval(); cls_head.to(device).eval() enc = tokenizer(text, max_length=128, padding='max_length', truncation=True, return_tensors='pt') with torch.no_grad(): pooled = bert(enc['input_ids'].to(device), enc['attention_mask'].to(device)).pooler_output logits = cls_head(pooled) probs = torch.softmax(logits, dim=-1).squeeze() pred_id = probs.argmax().item() return le.classes_[pred_id], probs[pred_id].item() ``` ## Leaderboard Context | Run | Acc | F1 Macro | Bleed | |---|---|---|---| | V3 Two-Branch v1 | 86.82% | 0.8469 | 726 | | V5 Flat baseline | 82.41% | 0.8308 | 1,265 | | V5 Cooperative source-aware | 83.23% | 0.8381 | 1,259 | | V5 DANN (ablation) | 81.50% | 0.8245 | 1,373 | | **V5 Cleanlab (this)** | **83.14%** | **0.8384** | **1,267** | ## Project Helwan University — Faculty of Engineering, 2025–2026