You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Early Depression Detection using Longformer and Data Augmentation

This is a fine-tuned version of AIMH/mental-longformer-base-4096 for detecting linguistic markers of depression risk based on a user's entire posting history. This model is the primary artifact of the research project, "Early Depression Detection and Correlational Analysis on eRisk by Longformer and Data Augmentation."

Project Summary

This model was developed as part of a Master's research project to address the challenges of early depression detection from noisy and imbalanced social media data. The methodology involved:

  1. Fine-tuning a domain-specific Mental-Longformer model, chosen for its ability to handle long user histories (up to 4096 tokens).
  2. Implementing an advanced data augmentation strategy using Gemini 2.5 Flash Lite to mitigate severe class imbalance.
  3. Conducting a comprehensive correlational analysis to uncover behavioral, social, and linguistic patterns of depression online.

On the final held-out eRisk 2025 test set, this model achieved an F1-score of 0.77 for the depressed class, demonstrating robust generalization.

Training Procedure

Base Model

This model was fine-tuned from AIMH/mental-longformer-base-4096, a Longformer model pre-trained on a large corpus of text from online mental health forums, making it highly specialized for this domain.

Training Data

The model was fine-tuned on user-level data from the eRisk dataset (CLEF 2017, 2018, and 2022). Due to the sensitive nature and licensing of this data, it cannot be redistributed. Please refer to the official CLEF eRisk workshops for information on data access.

Data Augmentation Strategy

To address the critical challenges of data scarcity and class imbalance, a multi-pronged data augmentation strategy was employed for the depressed (minority) class, powered by Gemini 2.5 Flash Lite:

  • Translation: Non-English posts from depressed users were translated into English to increase data volume.
  • Paraphrasing: Gemini was prompted to generate multiple, contextually relevant rephrased versions of existing depressed posts, increasing linguistic diversity.
  • Quality Control: Augmented samples were rigorously filtered based on semantic similarity and sentiment consistency to ensure high fidelity and prevent the introduction of noise.

This augmentation strategy proved highly effective, enabling the Longformer model to learn more robust patterns from an expanded minority class.

Performance

The model's performance was evaluated in two stages: through 5-fold cross-validation during training, and on a final, held-out test set (eRisk 2025).

Final Test Set Performance (eRisk 2025)

This is the primary result, showing the performance of the single best model on completely unseen data.

Class Precision Recall F1-Score Support
non-depressed (0) 0.9658 0.9789 0.9723 807
depressed (1) 0.8132 0.7255 0.7668 102
Accuracy 0.9505 909
Weighted Avg 0.9486 0.9505 0.9493 909

Training & Validation Stability (5-Fold Cross-Validation)

To ensure the model is robust, it was trained using 5-fold cross-validation on the combined 2017-2022 eRisk datasets. The average performance across the 5 validation folds demonstrates the model's stability.

  • Mean F1-Score across 5 Folds: 0.8623
  • Standard Deviation of F1-Score: 0.0093

The low standard deviation indicates that the model performs consistently across different subsets of the training data. The model uploaded here is the best-performing single model from Fold 1 of this process.

How to Use

You can use this model with a text-classification pipeline.

from transformers import pipeline

# Load the model from the Hub
pipe = pipeline("text-classification", model="avtak/erisk-longformer-depression-v1")

# The model works best on longer texts that represent a collection of posts
user_posts = """
I've been feeling really down lately. Nothing seems fun anymore...
I tried playing my favorite game but I just couldn't get into it.
Sleep is my only escape but I wake up feeling just as tired.
"""

result = pipe(user_posts)
print(result)
# [{'label': 'LABEL_1', 'score': 0.85}] -> Example output where LABEL_1 is the "depressed" class

Ethical Considerations and Limitations

  • Not a Diagnostic Tool: This model is NOT a medical diagnostic tool and should not be used as such. It only identifies statistical patterns in language that are correlated with a depression label in a specific dataset. Please consult a qualified healthcare professional for any mental health concerns.
  • High Risk of Misuse: Using this model to automatically label or judge individuals online is a misuse of the technology. It should only be used for research purposes under ethical guidelines.
  • Bias in Data: The training data is from Reddit, a platform with a specific demographic user base. The model may not generalize well to other platforms, cultures, or demographic groups. The linguistic expression of mental distress varies greatly.
  • Correlation, not Causation: The model identifies linguistic patterns correlated with depression, not the causes of depression.

Author and Contact

This model was developed by Hassan Hassanzadeh Aliabadi as part of a Master in Data Science degree at Universiti Malaya.

For questions about this model, please open a discussion on the Hugging Face community tab.

Citation

If you use this model in your research, please consider citing it:

@misc{hassanzadeh_aliabadi_erisk_2025,
  author = {Hassan Hassanzadeh Aliabadi},
  title = {Early Depression Detection and Correlational Analysis on eRisk by Longformer and Data Augmentation},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/avtak/erisk-longformer-depression-v1}}
}
Downloads last month
4
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for avtak/erisk-longformer-depression-v1

Finetuned
(5)
this model

Space using avtak/erisk-longformer-depression-v1 1