DeBERTa-v3-large + Psycholinguistic Features (5-Fold Ensemble)
> Task: 8-class emotion classification on sentence-level mental health discourse
> Backbone: microsoft/deberta-v3-large + 64 engineered features
> Validation: 5-Fold Stratified CV Macro F1 = 0.6326 Β± 0.0115
> Compute: Trained on Kaggle Tesla P100 (16GB) with FP16
π Model Summary
This model is a hybrid emotion classifier designed to detect granular emotional states in mental health discussions. It combines the contextual understanding of DeBERTa-v3-large with a dense vector of 64 engineered psycholinguistic features (sentiment, readability, keyword indicators, and linguistic style).
β Key Characteristics
- Not a single model: This is a 5-Fold Ensemble
- Why ensemble? Voting/averaging across folds reduces variance, improves stability, and generalizes better on unseen mental health text.
- Two-input system: The model consumes (a) tokenized text and (b) a 64-dim feature vector.
π§Ύ What it predicts
The model outputs one of 8 emotion categories:
| ID | Emotion |
|---|---|
| 0 | sadness |
| 1 | hopelessness |
| 2 | loneliness |
| 3 | anger |
| 4 | worthlessness |
| 5 | suicide intent |
| 6 | emptiness |
| 7 | brain dysfunction |
π§© Architecture Overview
| Component | Specification |
|---|---|
| Transformer backbone | microsoft/deberta-v3-large |
| Max sequence length | 512 |
| Auxiliary features | 64 engineered features (see below) |
| Fusion method | Concatenate [CLS] embedding + projected feature vector |
| Head | MLP classifier β 8 logits |
| Training strategy | 5-Fold Stratified Cross Validation + fold ensembling |
| Domain | Reddit mental health discourse (2015β2023) |
π§ Engineered Features (64 total)
Extracted from full_text = title + " " + sentence:
- Basic (16): char count, word count, sentence count, punctuation counts/ratios, uppercase ratio, titlecase count, readability indices.
- Sentiment (10): VADER compound/pos/neg/neu, intensity, sentiment flags, sentiment variance, positive keyword counts.
- Emotion keyword signals (26): per-class keyword count/present/ratio + total keyword count + diversity score.
- Linguistic (12): pronoun counts, negation counts, stopword ratio, unique word ratio, repetition, avg sentence length, ellipsis, ALLCAPS words.
β Feature extraction results from training run:
- Train features:
(22820, 64) - Competition validation features:
(4611, 64) - Saved artifacts:
train_with_features.csvval_with_features.csvfeature_columns.json
β οΈ IMPORTANT: The model expects feature vectors in the exact same column order as feature_columns.json.
ποΈ Dataset & Training Data
Dataset summary
- Total: 32,347 sentences from 5,154 Reddit posts (2015β2023)
- Train: 22,820 labeled sentences
- Competition validation file: 4,611 rows (unlabeled for leaderboard; used for submission generation)
- Average engagement: ~167 upvotes/post (dataset info)
Training label distribution (from run)
- sadness (0): 6956 (30.5%)
- hopelessness (1): 4571 (20.0%)
- loneliness (2): 3668 (16.1%)
- anger (3): 3025 (13.3%)
- worthlessness (4): 1586 (7.0%)
- suicide intent (5): 1297 (5.7%)
- emptiness (6): 1164 (5.1%)
- brain dysfunction (7): 553 (2.4%)
Preprocessing
- Constructed
full_textby concatenatingtitle+sentence - Extracted 64 engineered features from
full_text - NaNs handled in features
- Tokenization with truncation to max length 512
π Evaluation Results
Primary evaluation metric
- Macro F1 (balanced across classes), computed via 5-Fold Stratified CV
Results (from training output)
- Average Macro F1: 0.6326
- Standard deviation: 0.0115
- Fold scores:
0.6286, 0.6378, 0.6163, 0.6510, 0.6291
Notes on stability
Low std dev indicates the model is stable across folds and not overly dependent on a single split.
π» Usage
Inputs & Outputs
Inputs
title: str(can be empty)sentence: str
The model consumes:
- tokenized text tensors:
input_ids,attention_mask - numeric tensor:
feature_vectorof shape[batch, 64]
Output
logitsshape[batch, 8]- you can apply softmax β probabilities
- final label is
argmax(probs)
π Inference (Kaggle-friendly)
β οΈ Because this model takes two inputs, you cannot directly use AutoModelForSequenceClassification alone. You must use the custom model wrapper that fuses transformer embeddings with the 64-dim feature vector.
Minimal inference snippet (single fold)
import os
import json
import numpy as np
import torch
from transformers import AutoTokenizer
# You must provide these from your package/notebook:
# - CustomDebertaModel (hybrid model)
# - extract_features(text) -> np.array shape (64,) or torch.Tensor shape (1,64)
from model_utils import CustomDebertaModel, extract_features
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_DIR = "/kaggle/input/your-model-dataset/model_fold_0" # fold folder
WEIGHTS_PATH = os.path.join(MODEL_DIR, "model.safetensors")
FEATURE_COLUMNS_PATH = "/kaggle/input/your-model-dataset/feature_columns.json"
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-large")
# Load feature column order (VERY IMPORTANT)
with open(FEATURE_COLUMNS_PATH, "r") as f:
feature_cols = json.load(f)
model = CustomDebertaModel(n_features=64, n_classes=8)
# If your weights are in safetensors, load using safetensors:
from safetensors.torch import load_file
state_dict = load_file(WEIGHTS_PATH)
model.load_state_dict(state_dict, strict=True)
model.to(DEVICE).eval()
text = "I feel like I'm drowning in my own thoughts."
# Tokenize
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=512,
padding=False
)
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
# Extract features (must match training feature order)
feat = extract_features(text) # expected raw dict OR array
# If feat returns dict, reorder it using feature_cols:
if isinstance(feat, dict):
feat_vec = np.array([feat[c] for c in feature_cols], dtype=np.float32)[None, :]
else:
feat_vec = np.array(feat, dtype=np.float32)[None, :]
feature_vector = torch.tensor(feat_vec).to(DEVICE)
with torch.no_grad():
logits = model(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
feature_vector=feature_vector,
)
probs = torch.softmax(logits, dim=-1).cpu().numpy()[0]
pred_id = int(probs.argmax())
print("probs:", probs)
print("pred_id:", pred_id)
π³οΈ Ensemble Inference (Recommended)
For best performance, predictions from all 5 trained folds should be combined by averaging class probabilities. This reduces variance and improves robustness on unseen mental health text.
Each fold was trained independently using stratified splits. During inference, each fold produces a probability distribution over the 8 emotion classes, and the final prediction is computed as:
> Final Probability = Mean(Foldβ β¦ Foldβ Probabilities)
π Ensemble Inference Code
import os
import json
import numpy as np
import torch
from transformers import AutoTokenizer
from safetensors.torch import load_file
# Custom hybrid model + feature extractor
from model_utils import CustomDebertaModel, extract_features
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Paths (adjust to your Kaggle dataset structure)
MODEL_ROOT = "/kaggle/input/fragment-of-feeling-model"
FEATURE_COLUMNS_PATH = os.path.join(MODEL_ROOT, "feature_columns.json")
FOLD_DIRS = [
"model_fold_0",
"model_fold_1",
"model_fold_2",
"model_fold_3",
"model_fold_4",
]
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-large")
# Load feature column order
with open(FEATURE_COLUMNS_PATH, "r") as f:
feature_columns = json.load(f)
# Load all fold models
models = []
for fold in FOLD_DIRS:
model = CustomDebertaModel(n_features=64, n_classes=8)
weights = load_file(os.path.join(MODEL_ROOT, fold, "model.safetensors"))
model.load_state_dict(weights, strict=True)
model.to(DEVICE).eval()
models.append(model)
def ensemble_predict(text: str):
# Tokenize
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=512
)
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
# Feature extraction
raw_features = extract_features(text)
if isinstance(raw_features, dict):
feature_vector = np.array(
[raw_features[col] for col in feature_columns],
dtype=np.float32
)[None, :]
else:
feature_vector = np.array(raw_features, dtype=np.float32)[None, :]
feature_vector = torch.tensor(feature_vector).to(DEVICE)
# Average probabilities
probs_sum = None
with torch.no_grad():
for model in models:
logits = model(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
feature_vector=feature_vector
)
probs = torch.softmax(logits, dim=-1)
probs_sum = probs if probs_sum is None else probs_sum + probs
probs_avg = (probs_sum / len(models)).cpu().numpy()[0]
pred_class = int(probs_avg.argmax())
return pred_class, probs_avg
# Example
pred, probs = ensemble_predict("Nothing feels real anymore.")
print("Predicted class:", pred)
print("Probabilities:", probs)
π Input & Output Specification
Output
| Name | Shape | Type | Description |
|---|---|---|---|
| logits | [batch, 8] |
float32 | Raw, unnormalized scores for each emotion class |
| probabilities | [batch, 8] |
float32 | Softmax-normalized class probabilities |
The final predicted emotion is obtained via argmax(probabilities).
βοΈ System & Integration Details
Standalone or System Component
This model can operate as a standalone text classifier, but is designed to function as part of a hybrid NLP system when combined with its preprocessing pipeline (feature extraction + tokenization).
Upstream Requirements
Before inference, the following steps must be completed:
- Concatenate
titleandsentenceinto a singlefull_textfield - Extract the 64-dimensional psycholinguistic feature vector from
full_text - Ensure feature ordering exactly matches
feature_columns.json - Tokenize text using
microsoft/deberta-v3-largetokenizer (max length = 512)
Failure to follow these steps will result in invalid predictions.
Downstream Dependencies
- Model outputs are probabilistic signals, not deterministic labels
- Recommended downstream handling:
- Confidence thresholding
- Abstain / fallback logic for low-confidence predictions
- Human-in-the-loop review for sensitive cases
π οΈ Implementation Requirements
Training Hardware
| Component | Specification |
|---|---|
| GPU | NVIDIA Tesla P100 (16GB) |
| Environment | Kaggle Notebook |
| Precision | FP16 (Mixed Precision) |
| Training Strategy | 5-Fold Stratified Cross Validation |
Inference Hardware
| Component | Specification |
|---|---|
| GPU (Recommended) | NVIDIA T4 / P100 |
| CPU | Supported (higher latency) |
π§ͺ Software Stack
The following software versions were used during training and inference:
- Python β₯ 3.10
- torch β₯ 2.0.0
- transformers β₯ 4.30.0
- nltk (VADER sentiment analysis)
- textstat (readability metrics)
- safetensors (secure weight loading)
β‘ Model Characteristics
Initialization
- Initialized from a pretrained language model
- Backbone:
microsoft/deberta-v3-large - Classification head randomly initialized and fine-tuned on task labels
Model Statistics
| Attribute | Value | Notes |
|---|---|---|
| Total Parameters | ~435M | Backbone + fusion head |
| Transformer Layers | 24 | DeBERTa-Large |
| Hidden Size | 1024 | CLS embedding |
| Precision | FP16 | Mixed precision training |
| Pruning | None | Full model retained |
| Quantization | None | No post-training compression |
π Data Overview
Training Data
| Attribute | Description |
|---|---|
| Source | Reddit mental health communities |
| Communities | r/depression, r/anxiety, r/suicidewatch |
| Total Posts | 5,154 |
| Total Sentences | 32,347 |
| Labeled Training Set | 22,820 sentences |
| Time Range | 2015 β 2023 |
Preprocessing Steps
- Sentence-level segmentation
- Removal of usernames, URLs, and identifiable metadata
- Construction of
full_text = title + sentence - Feature extraction + NaN handling
Demographic Information
- No explicit demographic labels provided
- User population is anonymous
- Likely skew towards ages 18β34 based on Reddit usage patterns
π Evaluation Results
Evaluation Protocol
- Metric: Macro F1 Score
- Validation Strategy: 5-Fold Stratified Cross Validation
- Objective: Balanced performance across all emotion classes
Results Summary
| Metric | Value |
|---|---|
| Macro F1 (mean) | 0.6326 |
| Standard Deviation | 0.0115 |
| Fold Scores | 0.6286, 0.6378, 0.6163, 0.6510, 0.6291 |
Low variance across folds indicates stable generalization.
π Subgroup Analysis & Observed Biases
Identified Behaviors
Keyword Sensitivity:
Explicit distress terms (e.g., βsuicideβ, βkill myselfβ) strongly influence predictions.Length Bias:
Best performance on inputs between 10β30 words.
Higher variance observed for very short inputs (<5 words).Domain Dependence:
Model is optimized for Reddit-style, informal English text.
Mitigation Strategies
- Ensemble averaging across folds
- Probability-based confidence thresholds
- Optional human review for high-risk outputs
β οΈ Usage Limitations
THIS MODEL IS NOT A MEDICAL DEVICE
- Not clinically validated
- Not suitable for autonomous mental health diagnosis
- Not safe for automated suicide-risk intervention systems
- Intended strictly for research, benchmarking, and educational use
π§ Ethics & Responsible Use
- All training data is anonymized
- No personally identifiable information is stored or inferred
- Outputs reflect language patterns, not mental states
- High-risk applications must include human oversight
π¦ Recommended Model Package Structure
fragment-of-feeling-model/
βββ model_fold_0/
β βββ model.safetensors
βββ model_fold_1/
β βββ model.safetensors
βββ model_fold_2/
β βββ model.safetensors
βββ model_fold_3/
β βββ model.safetensors
βββ model_fold_4/
β βββ model.safetensors
βββ feature_columns.json
βββ model_utils.py
βββ README.md
π£ Reported Training Output
- Cross-validation completed successfully
- Macro F1 (5-Fold CV): 0.6326
- Standard deviation: 0.0115
- Submission file generated without errors
β Intended Use Statement
This model is intended for research and educational purposes to study emotional signal detection in anonymized mental health text.
Any real-world deployment must include robust safeguards, calibration, and human-in-the-loop review.
Find related codes here https://www.kaggle.com/models/sanjidh090/fragment_final_model/code