You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

MindGuard-8B: Clinically Grounded Safety Classifier for Mental Health AI

Safety Pipeline

MindGuard-8B is a lightweight safety classifier specifically designed for mental health conversations. Developed by Sword Health in collaboration with PhD-level licensed clinical psychologists, this model introduces a clinically grounded risk taxonomy that distinguishes actionable harm from non-crisis therapeutic content.

Model Overview

MindGuard-8B achieves 0.982 AUROC while providing 2-26× reduction in false positives compared to general-purpose safeguards at high-recall operating points. Even as a 4B-parameter model, it outperforms baseline models that are 30× larger, making it suitable for real-time deployment in mental health applications.

Key Performance Metrics

  • Model Size: 8 billion parameters
  • Inference Speed: Optimized for real-time classification
  • Clinical Validation: Developed with licensed psychologists

Model Details

Model Description

MindGuard-8B is a transformer-based safety classifier fine-tuned specifically for mental health conversation safety assessment. Unlike general-purpose content moderation systems, it's designed to understand the nuanced difference between therapeutic disclosure and genuine clinical crisis.

  • Developed by: Sword Health
  • Language: English
  • License: CC-BY-NC-SA-4.0
  • Specialized for: Mental health conversation safety assessment

Clinical Risk Taxonomy

The model classifies content into three clinically grounded categories:

✓ Safe

  • Common therapeutic topics
  • Depression/anxiety symptoms without suicidal ideation
  • Metaphorical expressions of distress
  • Historical or third-person references to trauma

⚠ Self-Harm Risk

  • Explicit or implicit suicidal ideation
  • Self-directed harm indicators
  • Content requiring clinical risk assessment

⚠ Harm to Others Risk

  • Threats toward identifiable individuals
  • Violent ideation
  • Abuse/neglect of protected populations

Usage Examples

Basic Classification

from transformers import AutoTokenizer

from vllm import LLM, SamplingParams

# Load model and tokenizer
model_name = "swordhealth/MindGuard-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example conversation context
# Note that the model is trained for multi-turn and with short contexts or single prompts its performance is worse.
conversation = [
    {"role": "assistant", "content": "Hey, I'm glad you reached out. How are you doing today?"},
    {"role": "user", "content": "Stressed. Taking care of my mom with dementia is hell. She screams, hits me, calls me thief — drives me insane."},
    {"role": "assistant", "content": "I'm so sorry you're carrying this weight. It's okay to feel overwhelmed — your strength matters, even when it feels like you're breaking."},
    {"role": "user", "content": "I'm not breaking. She pushes me, so I push her back. If she didn't act like a lunatic, I wouldn't have to huddle her the way I do."}
]

sampling_params = SamplingParams(best_of=1, temperature=0, max_tokens=15)

mindguard = LLM(model="swordhealth/MindGuard-8B", tensor_parallel_size=1, max_model_len=4096)
# Prepare input for the model
input_ids = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
outputs = mindguard.generate([input_ids], sampling_params)
print (outputs[0].outputs[0].text)
> Safety: Unsafe
> Categories: S2

Performance Comparison

On the MindGuard-testset benchmark:

Model AUROC FPR@90%TPR Parameters
MindGuard 0.982 0.031 8B
gpt-oss-safeguard 0.960 0.084 120B
Llama Guard 3 0.970 0.066 8B

Limitations and Bias

Known Limitations

  • Multi-turn: The model was trained for multi-turn conversations and not for single prompt. Its not meant to be used with single messages or with short context (1-2 turns).
  • English-only: Currently trained and validated only on English conversations
  • Cultural considerations: Training data may not fully represent all cultural expressions of distress
  • Real-time constraints: Performance may vary with very long conversation contexts
  • Potential for Errors: Like all AI systems, this model may produce false positives, false negatives, or other classification errors

Training Details

Training Data

The model was trained using:

  • Synthetic mental health conversations generated with clinical supervision
  • Expert annotations from licensed clinical psychologists
  • Diverse risk scenarios and therapeutic contexts
  • Balanced representation of safe and unsafe content

Evaluation Data

Evaluated on MindGuard-testset: 1,134 expert-annotated turns from 67 multi-turn conversations, annotated by licensed clinical psychologists with 94.4% agreement.

License

This dataset is released under the CC-BY-NC-SA-4.0 license.

Important Disclaimer

⚠️ RESEARCH USE ONLY - NO COMMERCIAL APPLICATION PERMITTED ⚠️

This model is provided under the CC-BY-NC-SA-4.0 license for research purposes only. By using this model, you acknowledge and agree to the following terms:

License Restrictions

  • No Commercial Use: This model is explicitly prohibited from use in any commercial application, product, or service
  • Non-Commercial Research Only: Permitted uses are limited to academic research, educational purposes, and non-commercial mental health research
  • Attribution Required: Any use must provide appropriate attribution as specified in the CC-BY-NC-SA-4.0 license

Clinical Limitations and Liability

  • Research Tool Only: This classifier is intended solely for research purposes in mental health AI safety
  • Human Oversight Required: Any application must maintain appropriate human clinical oversight and intervention protocols
  • Potential for Errors: Like all AI systems, this model may produce false positives, false negatives, or other classification errors

Disclaimer of Responsibility

Sword Health disclaims all responsibility and liability for:

  • Any clinical decisions made based on model outputs
  • Any harm resulting from model misclassification or errors
  • Inappropriate use of the model in commercial or clinical settings
  • Failure to maintain appropriate human oversight and intervention protocols

By using this model, you assume full responsibility for ensuring appropriate and ethical use within the bounds of the specified license and these terms.

Citation

@misc{mindguardguard,
      title={MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support}, 
      author={António Farinhas and Nuno M. Guerreiro and José Pombal and Pedro Henrique Martins and Laura Melton and Alex Conway and Cara Dochat and Maya D'Eon and Ricardo Rei},
      year={2026},
      eprint={2602.00950},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.00950}, 
}
Downloads last month
6
Safetensors
Model size
8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including swordhealth/MindGuard-8B

Paper for swordhealth/MindGuard-8B