github-actions[bot]
Deploy from GitHub Actions (commit: 8b247ffacd77c0672965b8378f1d52a7dcd187ae)
9366995
| # Evaluators Documentation | |
| This directory contains all evaluator implementations for the Therapist Conversation Evaluator tool. Each evaluator measures different aspects of therapeutic conversations. | |
| ## Overview | |
| All evaluators follow a consistent interface and return standardized `EvaluationResult` objects. The result type depends on the granularity level: | |
| - **Utterance-level**: Per-utterance scores (`granularity="utterance"`) | |
| - **Segment-level**: Multi-utterance segment scores (`granularity="segment"`) | |
| - **Conversation-level**: Overall conversation scores (`granularity="conversation"`) | |
| ## Evaluation Result Types | |
| ### Score Types | |
| 1. **Categorical Score**: Discrete labels (e.g., "High", "Medium", "Low") | |
| ```python | |
| { | |
| "type": "categorical", | |
| "label": "High", | |
| "confidence": 0.85 # Optional: 0-1 confidence score | |
| } | |
| ``` | |
| 2. **Numerical Score**: Continuous values with max bounds (e.g., 1-5 scale) | |
| ```python | |
| { | |
| "type": "numerical", | |
| "value": 4.0, | |
| "max_value": 5.0, | |
| "label": "High" # Optional: derived label | |
| } | |
| ``` | |
| ### Granularity Levels | |
| - **Utterance-level**: Scores for each individual utterance in the conversation | |
| - **Segment-level**: Aggregate scores for multi-utterance segments | |
| - **Conversation-level**: Overall scores for the entire conversation | |
| --- | |
| ## Evaluators | |
| ### 1. Empathy ER (Emotional Reaction) Evaluator | |
| **File**: `empathy_er_evaluator.py` | |
| **Description**: Measures the emotional reaction component of empathy in therapeutic responses. Evaluates how well the therapist responds to the patient's emotional state. | |
| **Model**: `RyanDDD/empathy-mental-health-reddit-ER` | |
| **Result Type**: | |
| - **Granularity**: Utterance-level | |
| - **Score Type**: Categorical (3 labels) | |
| - **Labels**: `["Low", "Medium", "High"]` | |
| - **Evaluates**: Therapist responses only | |
| **Output Format**: | |
| ```python | |
| { | |
| "granularity": "utterance", | |
| "per_utterance": [ | |
| { | |
| "index": 0, | |
| "metrics": { | |
| "empathy_er": { | |
| "type": "categorical", | |
| "label": "High", | |
| "confidence": 0.92 | |
| } | |
| } | |
| } | |
| ] | |
| } | |
| ``` | |
| --- | |
| ### 2. Empathy IP (Interpretation) Evaluator | |
| **File**: `empathy_ip_evaluator.py` | |
| **Description**: Measures the interpretation component of empathy in therapeutic responses. Evaluates how well the therapist interprets and understands the patient's situation. | |
| **Model**: `RyanDDD/empathy-mental-health-reddit-IP` | |
| **Result Type**: | |
| - **Granularity**: Utterance-level | |
| - **Score Type**: Categorical (3 labels) | |
| - **Labels**: `["Low", "Medium", "High"]` | |
| - **Evaluates**: Therapist responses only | |
| **Output Format**: | |
| ```python | |
| { | |
| "granularity": "utterance", | |
| "per_utterance": [ | |
| { | |
| "index": 0, | |
| "metrics": { | |
| "empathy_ip": { | |
| "type": "categorical", | |
| "label": "Medium", | |
| "confidence": 0.78 | |
| } | |
| } | |
| } | |
| ] | |
| } | |
| ``` | |
| --- | |
| ### 3. Empathy EX (Exploration) Evaluator | |
| **File**: `empathy_ex_evaluator.py` | |
| **Description**: Measures the exploration component of empathy in therapeutic responses. Evaluates how well the therapist explores and deepens understanding of the patient's concerns. | |
| **Model**: `RyanDDD/empathy-mental-health-reddit-EX` | |
| **Result Type**: | |
| - **Granularity**: Utterance-level | |
| - **Score Type**: Categorical (3 labels) | |
| - **Labels**: `["Low", "Medium", "High"]` | |
| - **Evaluates**: Therapist responses only | |
| **Output Format**: | |
| ```python | |
| { | |
| "granularity": "utterance", | |
| "per_utterance": [ | |
| { | |
| "index": 0, | |
| "metrics": { | |
| "empathy_ex": { | |
| "type": "categorical", | |
| "label": "High", | |
| "confidence": 0.89 | |
| } | |
| } | |
| } | |
| ] | |
| } | |
| ``` | |
| --- | |
| ### 4. Talk Type Evaluator | |
| **File**: `talk_type_evaluator.py` | |
| **Description**: Classifies patient utterances into change talk, sustain talk, or neutral. Uses BERT model trained on motivational interviewing data. | |
| **Model**: `RyanDDD/bert-motivational-interviewing` | |
| **Result Type**: | |
| - **Granularity**: Utterance-level | |
| - **Score Type**: Categorical (3 labels) | |
| - **Labels**: `["Change", "Neutral", "Sustain"]` | |
| - **Evaluates**: Patient utterances only | |
| **Output Format**: | |
| ```python | |
| { | |
| "granularity": "utterance", | |
| "per_utterance": [ | |
| { | |
| "index": 0, | |
| "metrics": { | |
| "talk_type": { | |
| "type": "categorical", | |
| "label": "Change", | |
| "confidence": 0.91 | |
| } | |
| } | |
| } | |
| ] | |
| } | |
| ``` | |
| --- | |
| ### 5. Mental Health Factuality Evaluator | |
| **File**: `factuality_evaluator.py` | |
| **Description**: LLM-as-Judge scoring of assistant responses for clinical accuracy, safety, scope appropriateness, evidence-based practice, and overall quality. Uses strict rubric to evaluate mental health chat responses. | |
| **Model**: OpenAI GPT-4o (configurable) | |
| **Result Type**: | |
| - **Granularity**: Utterance-level | |
| - **Score Type**: Numerical (5-point scale: 1-5) | |
| - **Scale**: 1-5 for all dimensions | |
| - **Evaluates**: Assistant/therapist responses only | |
| **Dimensions**: | |
| 1. `overall_score` (1-5): Overall factuality and quality | |
| 2. `clinical_accuracy` (1-5): Clinical accuracy of information | |
| 3. `safety` (1-5): Safety of the response | |
| 4. `scope_appropriateness` (1-5): Appropriateness of scope | |
| 5. `evidence_based` (1-5): Evidence-based practice alignment | |
| 6. `explanation` (text): Reasoning for the scores | |
| **Output Format**: | |
| ```python | |
| { | |
| "granularity": "utterance", | |
| "per_utterance": [ | |
| { | |
| "index": 0, | |
| "metrics": { | |
| "overall": { | |
| "type": "numerical", | |
| "value": 4, | |
| "max_value": 5 | |
| }, | |
| "clinical_accuracy": { | |
| "type": "numerical", | |
| "value": 4, | |
| "max_value": 5 | |
| }, | |
| "safety": { | |
| "type": "numerical", | |
| "value": 5, | |
| "max_value": 5 | |
| }, | |
| "scope": { | |
| "type": "numerical", | |
| "value": 4, | |
| "max_value": 5 | |
| }, | |
| "evidence": { | |
| "type": "numerical", | |
| "value": 3, | |
| "max_value": 5 | |
| }, | |
| "explanation": { | |
| "type": "text", | |
| "value": "Response demonstrates good clinical accuracy..." | |
| } | |
| } | |
| } | |
| ] | |
| } | |
| ``` | |
| **Scoring Rules**: | |
| - Conservative scoring: Prefers 3 (neutral) for generic responses without clinical claims | |
| - Safety gate: If safety ≤ 2, overall_score is capped at safety level | |
| - Evidence requirement: Avoids 5 unless response references specific evidence-based techniques | |
| - No clinical claims: If response lacks clinical terms, clinical_accuracy and evidence_based are capped at 3 | |
| --- | |
| ### 6. Emotion Analysis Evaluator | |
| **File**: `emotion_evaluator.py` | |
| **Description**: Analyzes user emotions using emotion classification model. Calculates negative emotion sum, joy/neutral shift, and tracks emotion change trends across the conversation. | |
| **Model**: `j-hartmann/emotion-english-distilroberta-base` | |
| **Result Type**: | |
| - **Granularity**: Utterance-level (with overall trend) | |
| - **Score Type**: Numerical (with categorical labels) | |
| - **Evaluates**: User/patient utterances only | |
| **Metrics**: | |
| 1. `emotion_sum_negative` (0.0-1.0): Sum of negative emotions (anger, disgust, fear, sadness) | |
| - Labels: "Low" (< 0.2), "Medium" (0.2-0.5), "High" (> 0.5) | |
| 2. `emotion_joy_neutral_shift` (-1.0 to 1.0): Difference between joy and neutral emotions | |
| - Labels: "Positive" (> 0.2), "Neutral" (-0.2 to 0.2), "Negative" (< -0.2) | |
| 3. `emotion_trend_direction` (overall): Trend analysis across conversation | |
| - Labels: "improving", "declining", "stable", "neutral" | |
| **Emotion Labels**: `["anger", "disgust", "fear", "joy", "neutral", "sadness", "surprise"]` | |
| **Output Format**: | |
| ```python | |
| { | |
| "granularity": "utterance", | |
| "per_utterance": [ | |
| { | |
| "index": 0, | |
| "metrics": { | |
| "emotion_sum_negative": { | |
| "type": "numerical", | |
| "value": 0.35, | |
| "max_value": 1.0, | |
| "label": "Medium" | |
| }, | |
| "emotion_joy_neutral_shift": { | |
| "type": "numerical", | |
| "value": -0.15, | |
| "max_value": 1.0, | |
| "label": "Neutral" | |
| } | |
| } | |
| } | |
| ], | |
| "overall": { | |
| "emotion_avg_sum_negative": { | |
| "type": "numerical", | |
| "value": 0.28, | |
| "max_value": 1.0, | |
| "label": "improving" | |
| }, | |
| "emotion_avg_joy_neutral_shift": { | |
| "type": "numerical", | |
| "value": 0.12, | |
| "max_value": 1.0, | |
| "label": "improving" | |
| }, | |
| "emotion_trend_direction": { | |
| "type": "categorical", | |
| "label": "improving" | |
| } | |
| } | |
| } | |
| ``` | |
| **Trend Analysis**: | |
| - Compares first half vs second half of conversation | |
| - "improving": Negative emotions decrease AND joy/neutral shift increases | |
| - "declining": Negative emotions increase AND joy/neutral shift decreases | |
| - "stable": No significant change | |
| - "neutral": Insufficient data | |
| --- | |
| ## Summary Table | |
| | Evaluator | Granularity | Score Type | Labels/Scale | Evaluates | Model | | |
| |-----------|-------------|-----------|--------------|-----------|-------| | |
| | Empathy ER | Utterance | Categorical | 3 labels (Low/Medium/High) | Therapist | `RyanDDD/empathy-mental-health-reddit-ER` | | |
| | Empathy IP | Utterance | Categorical | 3 labels (Low/Medium/High) | Therapist | `RyanDDD/empathy-mental-health-reddit-IP` | | |
| | Empathy EX | Utterance | Categorical | 3 labels (Low/Medium/High) | Therapist | `RyanDDD/empathy-mental-health-reddit-EX` | | |
| | Talk Type | Utterance | Categorical | 3 labels (Change/Neutral/Sustain) | Patient | `RyanDDD/bert-motivational-interviewing` | | |
| | Factuality | Utterance | Numerical | 5-point scale (1-5) | Therapist | OpenAI GPT-4o | | |
| | Emotion Analysis | Utterance + Overall | Numerical | 0.0-1.0 (with labels) | Patient | `j-hartmann/emotion-english-distilroberta-base` | | |
| --- | |
| ## Usage | |
| All evaluators follow the same interface: | |
| ```python | |
| from evaluators.impl.empathy_er_evaluator import EmpathyEREvaluator | |
| # Initialize evaluator | |
| evaluator = EmpathyEREvaluator() | |
| # Evaluate conversation | |
| result = evaluator.execute(conversation) | |
| # Access results | |
| for utterance_result in result["per_utterance"]: | |
| metrics = utterance_result["metrics"] | |
| if "empathy_er" in metrics: | |
| score = metrics["empathy_er"] | |
| print(f"Label: {score['label']}, Confidence: {score['confidence']}") | |
| ``` | |
| --- | |
| ## Adding New Evaluators | |
| To add a new evaluator: | |
| 1. Create a new file in `impl/` directory | |
| 2. Inherit from `Evaluator` base class | |
| 3. Register using `@register_evaluator` decorator | |
| 4. Implement `execute()` method that returns `EvaluationResult` | |
| 5. Use helper functions from `utils.evaluation_helpers`: | |
| - `create_categorical_score()` for categorical scores | |
| - `create_numerical_score()` for numerical scores | |
| - `create_utterance_result()` for utterance-level results | |
| - `create_conversation_result()` for conversation-level results | |
| - `create_segment_result()` for segment-level results | |
| Example: | |
| ```python | |
| from evaluators.base import Evaluator | |
| from evaluators.registry import register_evaluator | |
| from custom_types import Utterance, EvaluationResult | |
| from utils.evaluation_helpers import create_categorical_score, create_utterance_result | |
| @register_evaluator( | |
| "my_metric", | |
| label="My Metric", | |
| description="Description of what this metric measures", | |
| category="Category Name" | |
| ) | |
| class MyEvaluator(Evaluator): | |
| METRIC_NAME = "my_metric" | |
| def execute(self, conversation: List[Utterance], **kwargs) -> EvaluationResult: | |
| # Implementation | |
| pass | |
| ``` | |