You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Early Depression Detection using Longformer and Data Augmentation

This is a fine-tuned version of AIMH/mental-longformer-base-4096 for detecting linguistic markers of depression risk based on a user's entire posting history. This model is the primary artifact of the research project, "Early Depression Detection and Correlational Analysis on eRisk by Longformer and Data Augmentation."

Project Summary

This model was developed as part of a Master's research project to address the challenges of early depression detection from noisy and imbalanced social media data. The methodology involved:

Fine-tuning a domain-specific Mental-Longformer model, chosen for its ability to handle long user histories (up to 4096 tokens).
Implementing an advanced data augmentation strategy using Gemini 2.5 Flash Lite to mitigate severe class imbalance.
Conducting a comprehensive correlational analysis to uncover behavioral, social, and linguistic patterns of depression online.

On the final held-out eRisk 2025 test set, this model achieved an F1-score of 0.77 for the depressed class, demonstrating robust generalization.

Training Procedure

Base Model

This model was fine-tuned from AIMH/mental-longformer-base-4096, a Longformer model pre-trained on a large corpus of text from online mental health forums, making it highly specialized for this domain.

Training Data

The model was fine-tuned on user-level data from the eRisk dataset (CLEF 2017, 2018, and 2022). Due to the sensitive nature and licensing of this data, it cannot be redistributed. Please refer to the official CLEF eRisk workshops for information on data access.

Data Augmentation Strategy

To address the critical challenges of data scarcity and class imbalance, a multi-pronged data augmentation strategy was employed for the depressed (minority) class, powered by Gemini 2.5 Flash Lite:

Translation: Non-English posts from depressed users were translated into English to increase data volume.
Paraphrasing: Gemini was prompted to generate multiple, contextually relevant rephrased versions of existing depressed posts, increasing linguistic diversity.
Quality Control: Augmented samples were rigorously filtered based on semantic similarity and sentiment consistency to ensure high fidelity and prevent the introduction of noise.

This augmentation strategy proved highly effective, enabling the Longformer model to learn more robust patterns from an expanded minority class.

Performance

The model's performance was evaluated in two stages: through 5-fold cross-validation during training, and on a final, held-out test set (eRisk 2025).

Final Test Set Performance (eRisk 2025)

This is the primary result, showing the performance of the single best model on completely unseen data.

Class	Precision	Recall	F1-Score	Support
non-depressed (0)	0.9658	0.9789	0.9723	807
depressed (1)	0.8132	0.7255	0.7668	102

Accuracy			0.9505	909
Weighted Avg	0.9486	0.9505	0.9493	909

Training & Validation Stability (5-Fold Cross-Validation)

To ensure the model is robust, it was trained using 5-fold cross-validation on the combined 2017-2022 eRisk datasets. The average performance across the 5 validation folds demonstrates the model's stability.

Mean F1-Score across 5 Folds: 0.8623
Standard Deviation of F1-Score: 0.0093

The low standard deviation indicates that the model performs consistently across different subsets of the training data. The model uploaded here is the best-performing single model from Fold 1 of this process.

How to Use

You can use this model with a text-classification pipeline.

from transformers import pipeline

# Load the model from the Hub
pipe = pipeline("text-classification", model="avtak/erisk-longformer-depression-v1")

# The model works best on longer texts that represent a collection of posts
user_posts = """
I've been feeling really down lately. Nothing seems fun anymore...
I tried playing my favorite game but I just couldn't get into it.
Sleep is my only escape but I wake up feeling just as tired.
"""

result = pipe(user_posts)
print(result)
# [{'label': 'LABEL_1', 'score': 0.85}] -> Example output where LABEL_1 is the "depressed" class

Ethical Considerations and Limitations

Not a Diagnostic Tool: This model is NOT a medical diagnostic tool and should not be used as such. It only identifies statistical patterns in language that are correlated with a depression label in a specific dataset. Please consult a qualified healthcare professional for any mental health concerns.
High Risk of Misuse: Using this model to automatically label or judge individuals online is a misuse of the technology. It should only be used for research purposes under ethical guidelines.
Bias in Data: The training data is from Reddit, a platform with a specific demographic user base. The model may not generalize well to other platforms, cultures, or demographic groups. The linguistic expression of mental distress varies greatly.
Correlation, not Causation: The model identifies linguistic patterns correlated with depression, not the causes of depression.

Author and Contact

This model was developed by Hassan Hassanzadeh Aliabadi as part of a Master in Data Science degree at Universiti Malaya.

LinkedIn: https://www.linkedin.com/in/hassanzh/
Hugging Face: https://huggingface.co/avtak
Google Scholar: https://scholar.google.com/citations?hl=en&user=7sU9U1QAAAAJ

For questions about this model, please open a discussion on the Hugging Face community tab.

Citation

If you use this model in your research, please consider citing it:

@misc{hassanzadeh_aliabadi_erisk_2025,
  author = {Hassan Hassanzadeh Aliabadi},
  title = {Early Depression Detection and Correlational Analysis on eRisk by Longformer and Data Augmentation},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/avtak/erisk-longformer-depression-v1}}
}

Downloads last month: 4

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for avtak/erisk-longformer-depression-v1

Base model

AIMH/mental-longformer-base-4096

Finetuned

(5)

this model

avtak
/

erisk-longformer-depression-v1