Model Card for Model ID

Model Details

Model Description

  • Developed by: Clara Frydman-Gani, Alejandro Arias, Maria Perez Vallejo, John Daniel Londoño Martínez, Johanna Valencia-Echeverry, Mauricio Castaño, Alex A T Bui, Nelson B Freimer, Carlos Lopez-Jaramillo, Loes M Olde Loohuis
  • Funded by: See accompanying manuscript.
  • Shared by: LoesLab / Clara Frydman-Gani
  • Model type: Decoder-only generative large language model; instruction-tuned for Spanish clinical psychiatric phenotype extraction
  • Language(s) (NLP): Spanish
  • License: Apache 2.0
  • Finetuned from model: Mistral-Small-24B-Instruct-2501
  • Task: Psychiatric symptom-level phenotype detection in Spanish clinical text
  • Framework: PEFT 0.16.0; LoRA fine-tuning; Unsloth; 4-bit quantized base model

Model Sources

  • Repository: (https://github.com/clarafrydman/LLMs_for_psychiatric_phenotyping/tree/master/)
  • Paper: Frydman-Gani C, Arias A, Vallejo MP, Londoño Martínez JD, Valencia-Echeverry J, Castaño M, Bui AAT, Freimer NB, Lopez-Jaramillo C, Olde Loohuis LM. Large Language Models for Psychiatric Phenotype Extraction from Electronic Health Records. medRxiv [Preprint]. 2025 Aug 12:2025.08.07.25333172. doi: 10.1101/2025.08.07.25333172. PMID: 40832382; PMCID: PMC12363723.
  • Demo: See accompanying repository.

Uses

Direct Use

This model may be used to extract psychiatric phenotype mentions from Spanish-language clinical text. Given a Spanish clinical note, the model can generate a list of predicted symptom spans and corresponding phenotype labels.

Example use cases include:

  • Extracting symptom-level psychiatric concepts from Spanish clinical notes;
  • Supporting research pipelines for document-level psychiatric concept extraction;
  • Identifying candidate psychiatric phenotype mentions for downstream review, aggregation, or patient-level phenotyping workflows;
  • Assisting annotation or quality-control workflows, with human review.

This model does not itself perform full patient-level clinical phenotyping or evaluation, diagnosis assignment, risk prediction, or clinical adjudication/decision-making. These tasks require additional validation, reasoning, and clinical review.

Downstream Use

The model outputs may be used as inputs to downstream research workflows, such as:

Further fine-tuning on local clinical text for symptom detection; symptom burden estimation; cohort characterization; clinical NLP benchmarking; psychiatric research using Spanish-language electronic health record text.

This model does not itself perform full patient-level clinical phenotyping or evaluation, diagnosis assignment, risk prediction, or clinical adjudication/decision-making. These tasks require additional validation, reasoning, and clinical review.

Out-of-Scope Use

This model should not be used:

as a clinical decision-making tool; to diagnose patients; to determine treatment, hospitalization, disability, or risk status; without local expert validation on the intended clinical setting; on non-Spanish text without additional validation; as a replacement for clinician review; to process protected or identifiable clinical data outside secure, compliant environments; to make patient-level conclusions from a single document without longitudinal context.

Bias, Risks, and Limitations

This model has several important limitations:

Clinical and geographic scope: The model was developed and evaluated using Spanish clinical text from psychiatric hospitals in Colombia. Performance may differ in other Spanish-speaking countries, institutions, specialties, or documentation systems. Dialect and terminology variation: Clinical Spanish varies across regions and institutions, including abbreviations, symptom phrasing, and documentation conventions. Document-level extraction: The model extracts phenotype mentions from individual clinical documents. It does not perform full patient-level phenotyping, which may require longitudinal aggregation and clinical judgment. Out-of-ontology outputs: The model uses generative free-text decoding and may occasionally generate labels outside the intended psychiatric phenotype ontology. Outputs should be normalized, filtered, or validated before use. Synthetic-data training: The model was fine-tuned on synthetic clinical-style text rather than raw real EHR notes. Although this supports public sharing and privacy preservation, synthetic text may not capture all real-world documentation patterns. Weak-labeling bias: The synthetic dataset was augmented using a traditional NLP system to identify incidental phenotype mentions. The model may partially inherit biases or blind spots from that weak-labeling process. Quantization: The model was fine-tuned using 4-bit quantization. Quantization may affect performance, particularly for non-English clinical text and domain-specific terminology. Clinical safety: Incorrect extractions, missed symptoms, or out-of-ontology predictions could mislead downstream analyses if not reviewed or post-processed.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Users should:

validate the model on local data before deployment; map outputs to a predefined phenotype ontology; use structured output validation or ontology-constrained post-processing; review predictions manually; avoid using outputs as diagnoses or treatment recommendations; evaluate performance separately for each phenotype (rare or lexically variable phenotypes may have lower recall); document any local adaptation, fine-tuning, or post-processing applied.

How to Get Started with the Model

Example inference code using Unsloth:


from unsloth import FastLanguageModel
from transformers import GenerationConfig

model_id = "loeslab/mistral-small-psych"  # replace if different

max_seq_length = 5000
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_id,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

FastLanguageModel.for_inference(model)

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

instruction = (
    "I will provide a clinical note in Spanish. Identify a list of all symptoms from the input text, "
    "even if the symptom appears in a negated, hypothetical, or historical context, or described a different person. "
    "Format your answer as a list of the phenotypes separated by new line character and do not generate random answers. "
    "Only output the list."
)

clinical_note = """
Paciente refiere ánimo deprimido, insomnio y ansiedad durante las últimas semanas.
Niega ideación suicida actual.
"""

inputs = tokenizer(
    [
        alpaca_prompt.format(
            instruction,
            clinical_note,
            "",
        )
    ],
    return_tensors="pt",
).to("cuda")

generation_config = GenerationConfig(
    temperature=0,
)

outputs = model.generate(
    **inputs,
    max_new_tokens=1000, # adjust according to your needs
    generation_config=generation_config,
)

response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)

Expected output format may include one phenotype mention per line. Outputs should be post-processed and mapped to the study phenotype ontology before evaluation or downstream use.

Training Details

Training Data

Mistral-small-psych was fine-tuned on a synthetic Spanish clinical-text dataset designed to avoid exposure of real patient notes. Synthetic sentences were generated using Llama3-70B from previously published clinician-reviewed phenotype annotation spans. These spans were manually reviewed to ensure that they did not contain protected health information. Synthetic sentences were then assembled into synthetic documents.

The final synthetic training dataset contained:

1,223 synthetic documents Spanish clinical-style text psychiatric phenotype annotations based on: originally targeted phenotype spans additional incidental phenotype mentions identified using a traditional NLP weak-labeling approach

The model was not trained on raw EHR notes or identifiable patient information.

Training Procedure

The model was instruction-tuned to extract detected phenotypes from clinical text and return associated labeled spans. Training used LoRA parameter-efficient fine-tuning on a 4-bit quantized base model.

Preprocessing

Synthetic sentences were quality-controlled to ensure inclusion of intended annotation spans. Sentences were then grouped into synthetic documents and converted to Alpaca-style instruction-tuning format.

Training examples followed this general structure:

### Instruction:
I will provide a clinical note in Spanish. Identify a list of all symptoms from the input text...

### Input:
<DOCUMENT_BODY>

### Response:
<PHENOTYPE_1_TEXT>| <PHENOTYPE_1_LABEL>
<PHENOTYPE_2_TEXT>| <PHENOTYPE_2_LABEL>
...

# For example:
# "no quiero hacer nada | Abulia
# toma mucho | Abusodesustancias
# toma mucho | Alcohol
# Cambio de genio | Labilidademocional
# <END>"

Training Hyperparameters

Base model: Mistral-Small-24B-Instruct-2501 Fine-tuning method: LoRA Quantization: 4-bit LoRA rank: r=8 LoRA alpha: 16 LoRA dropout: 0.05 Target modules: all linear layers Precision: BF16 / bfloat16 Learning rate: 3e-4 Gradient accumulation steps: 16 Training framework: Unsloth Epochs: 9

  • Training regime: [More Information Needed]

Speeds, Sizes, Times [optional]

Training hardware: AWS g5.2xlarge instance GPU: 1 NVIDIA A10 Tensor Core GPU GPU VRAM: 24 GB Training time: ~4 hours for 9 epochs

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated on clinician-annotated Spanish EHR documents from:

Clínica San Juan de Dios Manizales (CSJDM), Colombia [primary test set] 358 EHR documents Hospital Mental de Antioquia (HOMO), Colombia [external cross-site test set] 309 EHR documents

Both datasets were annotated for psychiatric phenotypes by expert clinicians.

Factors

[More Information Needed]

Metrics

Performance was evaluated using: precision recall F1 cross-site performance drop

Reported aggregate metrics (macro-F1, micro-F1) were generally calculated across phenotypes with sufficient support in the relevant training and test sets.

Results

On the CSJDM test set, Mistral-small-psych achieved:

Precision: 0.82 (SD 0.17) Recall: 0.80 (SD 0.18) Macro-F1: 0.79 (SD 0.15) Micro-F1: 0.85 Balanced accuracy: 0.89 (SD 0.09)

These results were comparable to several models fine-tuned on real EHR data, while allowing public release because the model was trained on synthetic clinical-style text rather than raw EHR notes.

Summary

Model Examination [optional]

Qualitative error analysis showed that the model could occasionally generate phenotype labels outside the fine-tuning ontology, including medical conditions, medications, or medication categories. These outputs were excluded from ontology-based evaluation metrics and highlight the need for structured output validation and ontology-constrained post-processing in downstream applications.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: AWS g5.2xlarge; 1 NVIDIA A10 Tensor Core GPU, 24 GB VRAM
  • Hours used: 4
  • Cloud Provider: AWS
  • Compute Region: U.S. West/Oregon
  • Carbon Emitted: NA

Technical Specifications

Model Architecture and Objective

Mistral-small-psych is a decoder-only Transformer-based generative LLM fine-tuned using instruction tuning for Spanish psychiatric phenotype mention extraction. The model takes Spanish clinical text as input and generates phenotype spans and labels.

Compute Infrastructure

Mistral-small-psych is a decoder-only Transformer-based generative LLM fine-tuned using instruction tuning for Spanish psychiatric phenotype mention extraction. The model takes Spanish clinical text as input and generates phenotype spans and labels.

Hardware

Training: AWS g5.2xlarge, 1 NVIDIA A10 Tensor Core GPU, 24 GB VRAM Benchmarking/inference in the study: AWS G5 instances using NVIDIA A10G Tensor Core GPUs

Software

Python PyTorch Transformers PEFT 0.16.0 Unsloth Hugging Face Hub

Citation

BibTeX:

@article{frydman_gani_2025_llm_psychiatric_phenotype_extraction, title = {Large Language Models for Psychiatric Phenotype Extraction from Electronic Health Records}, author = {Frydman-Gani, Clara and Arias, Alejandro and Perez Vallejo, Maria and Londo{~n}o Mart{'i}nez, John Daniel and Valencia-Echeverry, Johanna and Casta{~n}o, Mauricio and Bui, Alex A. T. and Freimer, Nelson B. and Lopez-Jaramillo, Carlos and Olde Loohuis, Loes M.}, journal = {medRxiv}, year = {2025}, doi = {10.1101/2025.08.07.25333172}, note = {Preprint} }

APA:

Frydman-Gani, C., Arias, A., Perez Vallejo, M., Londoño Martínez, J. D., Valencia-Echeverry, J., Castaño, M., Bui, A. A. T., Freimer, N. B., Lopez-Jaramillo, C., & Olde Loohuis, L. M. (2025). Large Language Models for Psychiatric Phenotype Extraction from Electronic Health Records. medRxiv. https://doi.org/10.1101/2025.08.07.25333172

Glossary [optional]

EHR: Electronic health record LLM: Large language model LoRA: Low-rank adaptation NLP: Natural language processing PEFT: Parameter-efficient fine-tuning tNLP: Traditional natural language processing Phenotype mention extraction: Detection of explicitly mentioned symptoms, signs, behaviors, or mental status findings in clinical text Ontology-constrained decoding: Restricting model outputs to a predefined set of allowed labels/symptoms

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

Clara Frydman-Gani

Model Card Contact

Clara Frydman-Gani: clarafrydman@gmail.com

Framework versions

  • PEFT 0.16.0
Downloads last month
20
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for loeslab/mistral_small_psych