A newer version of the Streamlit SDK is available:
1.52.2
Evaluators Documentation
This directory contains all evaluator implementations for the Therapist Conversation Evaluator tool. Each evaluator measures different aspects of therapeutic conversations.
Overview
All evaluators follow a consistent interface and return standardized EvaluationResult objects. The result type depends on the granularity level:
- Utterance-level: Per-utterance scores (
granularity="utterance") - Segment-level: Multi-utterance segment scores (
granularity="segment") - Conversation-level: Overall conversation scores (
granularity="conversation")
Evaluation Result Types
Score Types
Categorical Score: Discrete labels (e.g., "High", "Medium", "Low")
{ "type": "categorical", "label": "High", "confidence": 0.85 # Optional: 0-1 confidence score }Numerical Score: Continuous values with max bounds (e.g., 1-5 scale)
{ "type": "numerical", "value": 4.0, "max_value": 5.0, "label": "High" # Optional: derived label }
Granularity Levels
- Utterance-level: Scores for each individual utterance in the conversation
- Segment-level: Aggregate scores for multi-utterance segments
- Conversation-level: Overall scores for the entire conversation
Evaluators
1. Empathy ER (Emotional Reaction) Evaluator
File: empathy_er_evaluator.py
Description: Measures the emotional reaction component of empathy in therapeutic responses. Evaluates how well the therapist responds to the patient's emotional state.
Model: RyanDDD/empathy-mental-health-reddit-ER
Result Type:
- Granularity: Utterance-level
- Score Type: Categorical (3 labels)
- Labels:
["Low", "Medium", "High"] - Evaluates: Therapist responses only
Output Format:
{
"granularity": "utterance",
"per_utterance": [
{
"index": 0,
"metrics": {
"empathy_er": {
"type": "categorical",
"label": "High",
"confidence": 0.92
}
}
}
]
}
2. Empathy IP (Interpretation) Evaluator
File: empathy_ip_evaluator.py
Description: Measures the interpretation component of empathy in therapeutic responses. Evaluates how well the therapist interprets and understands the patient's situation.
Model: RyanDDD/empathy-mental-health-reddit-IP
Result Type:
- Granularity: Utterance-level
- Score Type: Categorical (3 labels)
- Labels:
["Low", "Medium", "High"] - Evaluates: Therapist responses only
Output Format:
{
"granularity": "utterance",
"per_utterance": [
{
"index": 0,
"metrics": {
"empathy_ip": {
"type": "categorical",
"label": "Medium",
"confidence": 0.78
}
}
}
]
}
3. Empathy EX (Exploration) Evaluator
File: empathy_ex_evaluator.py
Description: Measures the exploration component of empathy in therapeutic responses. Evaluates how well the therapist explores and deepens understanding of the patient's concerns.
Model: RyanDDD/empathy-mental-health-reddit-EX
Result Type:
- Granularity: Utterance-level
- Score Type: Categorical (3 labels)
- Labels:
["Low", "Medium", "High"] - Evaluates: Therapist responses only
Output Format:
{
"granularity": "utterance",
"per_utterance": [
{
"index": 0,
"metrics": {
"empathy_ex": {
"type": "categorical",
"label": "High",
"confidence": 0.89
}
}
}
]
}
4. Talk Type Evaluator
File: talk_type_evaluator.py
Description: Classifies patient utterances into change talk, sustain talk, or neutral. Uses BERT model trained on motivational interviewing data.
Model: RyanDDD/bert-motivational-interviewing
Result Type:
- Granularity: Utterance-level
- Score Type: Categorical (3 labels)
- Labels:
["Change", "Neutral", "Sustain"] - Evaluates: Patient utterances only
Output Format:
{
"granularity": "utterance",
"per_utterance": [
{
"index": 0,
"metrics": {
"talk_type": {
"type": "categorical",
"label": "Change",
"confidence": 0.91
}
}
}
]
}
5. Mental Health Factuality Evaluator
File: factuality_evaluator.py
Description: LLM-as-Judge scoring of assistant responses for clinical accuracy, safety, scope appropriateness, evidence-based practice, and overall quality. Uses strict rubric to evaluate mental health chat responses.
Model: OpenAI GPT-4o (configurable)
Result Type:
- Granularity: Utterance-level
- Score Type: Numerical (5-point scale: 1-5)
- Scale: 1-5 for all dimensions
- Evaluates: Assistant/therapist responses only
Dimensions:
overall_score(1-5): Overall factuality and qualityclinical_accuracy(1-5): Clinical accuracy of informationsafety(1-5): Safety of the responsescope_appropriateness(1-5): Appropriateness of scopeevidence_based(1-5): Evidence-based practice alignmentexplanation(text): Reasoning for the scores
Output Format:
{
"granularity": "utterance",
"per_utterance": [
{
"index": 0,
"metrics": {
"overall": {
"type": "numerical",
"value": 4,
"max_value": 5
},
"clinical_accuracy": {
"type": "numerical",
"value": 4,
"max_value": 5
},
"safety": {
"type": "numerical",
"value": 5,
"max_value": 5
},
"scope": {
"type": "numerical",
"value": 4,
"max_value": 5
},
"evidence": {
"type": "numerical",
"value": 3,
"max_value": 5
},
"explanation": {
"type": "text",
"value": "Response demonstrates good clinical accuracy..."
}
}
}
]
}
Scoring Rules:
- Conservative scoring: Prefers 3 (neutral) for generic responses without clinical claims
- Safety gate: If safety ≤ 2, overall_score is capped at safety level
- Evidence requirement: Avoids 5 unless response references specific evidence-based techniques
- No clinical claims: If response lacks clinical terms, clinical_accuracy and evidence_based are capped at 3
6. Emotion Analysis Evaluator
File: emotion_evaluator.py
Description: Analyzes user emotions using emotion classification model. Calculates negative emotion sum, joy/neutral shift, and tracks emotion change trends across the conversation.
Model: j-hartmann/emotion-english-distilroberta-base
Result Type:
- Granularity: Utterance-level (with overall trend)
- Score Type: Numerical (with categorical labels)
- Evaluates: User/patient utterances only
Metrics:
emotion_sum_negative(0.0-1.0): Sum of negative emotions (anger, disgust, fear, sadness)- Labels: "Low" (< 0.2), "Medium" (0.2-0.5), "High" (> 0.5)
emotion_joy_neutral_shift(-1.0 to 1.0): Difference between joy and neutral emotions- Labels: "Positive" (> 0.2), "Neutral" (-0.2 to 0.2), "Negative" (< -0.2)
emotion_trend_direction(overall): Trend analysis across conversation- Labels: "improving", "declining", "stable", "neutral"
Emotion Labels: ["anger", "disgust", "fear", "joy", "neutral", "sadness", "surprise"]
Output Format:
{
"granularity": "utterance",
"per_utterance": [
{
"index": 0,
"metrics": {
"emotion_sum_negative": {
"type": "numerical",
"value": 0.35,
"max_value": 1.0,
"label": "Medium"
},
"emotion_joy_neutral_shift": {
"type": "numerical",
"value": -0.15,
"max_value": 1.0,
"label": "Neutral"
}
}
}
],
"overall": {
"emotion_avg_sum_negative": {
"type": "numerical",
"value": 0.28,
"max_value": 1.0,
"label": "improving"
},
"emotion_avg_joy_neutral_shift": {
"type": "numerical",
"value": 0.12,
"max_value": 1.0,
"label": "improving"
},
"emotion_trend_direction": {
"type": "categorical",
"label": "improving"
}
}
}
Trend Analysis:
- Compares first half vs second half of conversation
- "improving": Negative emotions decrease AND joy/neutral shift increases
- "declining": Negative emotions increase AND joy/neutral shift decreases
- "stable": No significant change
- "neutral": Insufficient data
Summary Table
| Evaluator | Granularity | Score Type | Labels/Scale | Evaluates | Model |
|---|---|---|---|---|---|
| Empathy ER | Utterance | Categorical | 3 labels (Low/Medium/High) | Therapist | RyanDDD/empathy-mental-health-reddit-ER |
| Empathy IP | Utterance | Categorical | 3 labels (Low/Medium/High) | Therapist | RyanDDD/empathy-mental-health-reddit-IP |
| Empathy EX | Utterance | Categorical | 3 labels (Low/Medium/High) | Therapist | RyanDDD/empathy-mental-health-reddit-EX |
| Talk Type | Utterance | Categorical | 3 labels (Change/Neutral/Sustain) | Patient | RyanDDD/bert-motivational-interviewing |
| Factuality | Utterance | Numerical | 5-point scale (1-5) | Therapist | OpenAI GPT-4o |
| Emotion Analysis | Utterance + Overall | Numerical | 0.0-1.0 (with labels) | Patient | j-hartmann/emotion-english-distilroberta-base |
Usage
All evaluators follow the same interface:
from evaluators.impl.empathy_er_evaluator import EmpathyEREvaluator
# Initialize evaluator
evaluator = EmpathyEREvaluator()
# Evaluate conversation
result = evaluator.execute(conversation)
# Access results
for utterance_result in result["per_utterance"]:
metrics = utterance_result["metrics"]
if "empathy_er" in metrics:
score = metrics["empathy_er"]
print(f"Label: {score['label']}, Confidence: {score['confidence']}")
Adding New Evaluators
To add a new evaluator:
- Create a new file in
impl/directory - Inherit from
Evaluatorbase class - Register using
@register_evaluatordecorator - Implement
execute()method that returnsEvaluationResult - Use helper functions from
utils.evaluation_helpers:create_categorical_score()for categorical scorescreate_numerical_score()for numerical scorescreate_utterance_result()for utterance-level resultscreate_conversation_result()for conversation-level resultscreate_segment_result()for segment-level results
Example:
from evaluators.base import Evaluator
from evaluators.registry import register_evaluator
from custom_types import Utterance, EvaluationResult
from utils.evaluation_helpers import create_categorical_score, create_utterance_result
@register_evaluator(
"my_metric",
label="My Metric",
description="Description of what this metric measures",
category="Category Name"
)
class MyEvaluator(Evaluator):
METRIC_NAME = "my_metric"
def execute(self, conversation: List[Utterance], **kwargs) -> EvaluationResult:
# Implementation
pass