hhh / evaluators /README.md
github-actions[bot]
Deploy from GitHub Actions (commit: 8b247ffacd77c0672965b8378f1d52a7dcd187ae)
9366995

A newer version of the Streamlit SDK is available: 1.52.2

Upgrade

Evaluators Documentation

This directory contains all evaluator implementations for the Therapist Conversation Evaluator tool. Each evaluator measures different aspects of therapeutic conversations.

Overview

All evaluators follow a consistent interface and return standardized EvaluationResult objects. The result type depends on the granularity level:

  • Utterance-level: Per-utterance scores (granularity="utterance")
  • Segment-level: Multi-utterance segment scores (granularity="segment")
  • Conversation-level: Overall conversation scores (granularity="conversation")

Evaluation Result Types

Score Types

  1. Categorical Score: Discrete labels (e.g., "High", "Medium", "Low")

    {
        "type": "categorical",
        "label": "High",
        "confidence": 0.85  # Optional: 0-1 confidence score
    }
    
  2. Numerical Score: Continuous values with max bounds (e.g., 1-5 scale)

    {
        "type": "numerical",
        "value": 4.0,
        "max_value": 5.0,
        "label": "High"  # Optional: derived label
    }
    

Granularity Levels

  • Utterance-level: Scores for each individual utterance in the conversation
  • Segment-level: Aggregate scores for multi-utterance segments
  • Conversation-level: Overall scores for the entire conversation

Evaluators

1. Empathy ER (Emotional Reaction) Evaluator

File: empathy_er_evaluator.py

Description: Measures the emotional reaction component of empathy in therapeutic responses. Evaluates how well the therapist responds to the patient's emotional state.

Model: RyanDDD/empathy-mental-health-reddit-ER

Result Type:

  • Granularity: Utterance-level
  • Score Type: Categorical (3 labels)
  • Labels: ["Low", "Medium", "High"]
  • Evaluates: Therapist responses only

Output Format:

{
    "granularity": "utterance",
    "per_utterance": [
        {
            "index": 0,
            "metrics": {
                "empathy_er": {
                    "type": "categorical",
                    "label": "High",
                    "confidence": 0.92
                }
            }
        }
    ]
}

2. Empathy IP (Interpretation) Evaluator

File: empathy_ip_evaluator.py

Description: Measures the interpretation component of empathy in therapeutic responses. Evaluates how well the therapist interprets and understands the patient's situation.

Model: RyanDDD/empathy-mental-health-reddit-IP

Result Type:

  • Granularity: Utterance-level
  • Score Type: Categorical (3 labels)
  • Labels: ["Low", "Medium", "High"]
  • Evaluates: Therapist responses only

Output Format:

{
    "granularity": "utterance",
    "per_utterance": [
        {
            "index": 0,
            "metrics": {
                "empathy_ip": {
                    "type": "categorical",
                    "label": "Medium",
                    "confidence": 0.78
                }
            }
        }
    ]
}

3. Empathy EX (Exploration) Evaluator

File: empathy_ex_evaluator.py

Description: Measures the exploration component of empathy in therapeutic responses. Evaluates how well the therapist explores and deepens understanding of the patient's concerns.

Model: RyanDDD/empathy-mental-health-reddit-EX

Result Type:

  • Granularity: Utterance-level
  • Score Type: Categorical (3 labels)
  • Labels: ["Low", "Medium", "High"]
  • Evaluates: Therapist responses only

Output Format:

{
    "granularity": "utterance",
    "per_utterance": [
        {
            "index": 0,
            "metrics": {
                "empathy_ex": {
                    "type": "categorical",
                    "label": "High",
                    "confidence": 0.89
                }
            }
        }
    ]
}

4. Talk Type Evaluator

File: talk_type_evaluator.py

Description: Classifies patient utterances into change talk, sustain talk, or neutral. Uses BERT model trained on motivational interviewing data.

Model: RyanDDD/bert-motivational-interviewing

Result Type:

  • Granularity: Utterance-level
  • Score Type: Categorical (3 labels)
  • Labels: ["Change", "Neutral", "Sustain"]
  • Evaluates: Patient utterances only

Output Format:

{
    "granularity": "utterance",
    "per_utterance": [
        {
            "index": 0,
            "metrics": {
                "talk_type": {
                    "type": "categorical",
                    "label": "Change",
                    "confidence": 0.91
                }
            }
        }
    ]
}

5. Mental Health Factuality Evaluator

File: factuality_evaluator.py

Description: LLM-as-Judge scoring of assistant responses for clinical accuracy, safety, scope appropriateness, evidence-based practice, and overall quality. Uses strict rubric to evaluate mental health chat responses.

Model: OpenAI GPT-4o (configurable)

Result Type:

  • Granularity: Utterance-level
  • Score Type: Numerical (5-point scale: 1-5)
  • Scale: 1-5 for all dimensions
  • Evaluates: Assistant/therapist responses only

Dimensions:

  1. overall_score (1-5): Overall factuality and quality
  2. clinical_accuracy (1-5): Clinical accuracy of information
  3. safety (1-5): Safety of the response
  4. scope_appropriateness (1-5): Appropriateness of scope
  5. evidence_based (1-5): Evidence-based practice alignment
  6. explanation (text): Reasoning for the scores

Output Format:

{
    "granularity": "utterance",
    "per_utterance": [
        {
            "index": 0,
            "metrics": {
                "overall": {
                    "type": "numerical",
                    "value": 4,
                    "max_value": 5
                },
                "clinical_accuracy": {
                    "type": "numerical",
                    "value": 4,
                    "max_value": 5
                },
                "safety": {
                    "type": "numerical",
                    "value": 5,
                    "max_value": 5
                },
                "scope": {
                    "type": "numerical",
                    "value": 4,
                    "max_value": 5
                },
                "evidence": {
                    "type": "numerical",
                    "value": 3,
                    "max_value": 5
                },
                "explanation": {
                    "type": "text",
                    "value": "Response demonstrates good clinical accuracy..."
                }
            }
        }
    ]
}

Scoring Rules:

  • Conservative scoring: Prefers 3 (neutral) for generic responses without clinical claims
  • Safety gate: If safety ≤ 2, overall_score is capped at safety level
  • Evidence requirement: Avoids 5 unless response references specific evidence-based techniques
  • No clinical claims: If response lacks clinical terms, clinical_accuracy and evidence_based are capped at 3

6. Emotion Analysis Evaluator

File: emotion_evaluator.py

Description: Analyzes user emotions using emotion classification model. Calculates negative emotion sum, joy/neutral shift, and tracks emotion change trends across the conversation.

Model: j-hartmann/emotion-english-distilroberta-base

Result Type:

  • Granularity: Utterance-level (with overall trend)
  • Score Type: Numerical (with categorical labels)
  • Evaluates: User/patient utterances only

Metrics:

  1. emotion_sum_negative (0.0-1.0): Sum of negative emotions (anger, disgust, fear, sadness)
    • Labels: "Low" (< 0.2), "Medium" (0.2-0.5), "High" (> 0.5)
  2. emotion_joy_neutral_shift (-1.0 to 1.0): Difference between joy and neutral emotions
    • Labels: "Positive" (> 0.2), "Neutral" (-0.2 to 0.2), "Negative" (< -0.2)
  3. emotion_trend_direction (overall): Trend analysis across conversation
    • Labels: "improving", "declining", "stable", "neutral"

Emotion Labels: ["anger", "disgust", "fear", "joy", "neutral", "sadness", "surprise"]

Output Format:

{
    "granularity": "utterance",
    "per_utterance": [
        {
            "index": 0,
            "metrics": {
                "emotion_sum_negative": {
                    "type": "numerical",
                    "value": 0.35,
                    "max_value": 1.0,
                    "label": "Medium"
                },
                "emotion_joy_neutral_shift": {
                    "type": "numerical",
                    "value": -0.15,
                    "max_value": 1.0,
                    "label": "Neutral"
                }
            }
        }
    ],
    "overall": {
        "emotion_avg_sum_negative": {
            "type": "numerical",
            "value": 0.28,
            "max_value": 1.0,
            "label": "improving"
        },
        "emotion_avg_joy_neutral_shift": {
            "type": "numerical",
            "value": 0.12,
            "max_value": 1.0,
            "label": "improving"
        },
        "emotion_trend_direction": {
            "type": "categorical",
            "label": "improving"
        }
    }
}

Trend Analysis:

  • Compares first half vs second half of conversation
  • "improving": Negative emotions decrease AND joy/neutral shift increases
  • "declining": Negative emotions increase AND joy/neutral shift decreases
  • "stable": No significant change
  • "neutral": Insufficient data

Summary Table

Evaluator Granularity Score Type Labels/Scale Evaluates Model
Empathy ER Utterance Categorical 3 labels (Low/Medium/High) Therapist RyanDDD/empathy-mental-health-reddit-ER
Empathy IP Utterance Categorical 3 labels (Low/Medium/High) Therapist RyanDDD/empathy-mental-health-reddit-IP
Empathy EX Utterance Categorical 3 labels (Low/Medium/High) Therapist RyanDDD/empathy-mental-health-reddit-EX
Talk Type Utterance Categorical 3 labels (Change/Neutral/Sustain) Patient RyanDDD/bert-motivational-interviewing
Factuality Utterance Numerical 5-point scale (1-5) Therapist OpenAI GPT-4o
Emotion Analysis Utterance + Overall Numerical 0.0-1.0 (with labels) Patient j-hartmann/emotion-english-distilroberta-base

Usage

All evaluators follow the same interface:

from evaluators.impl.empathy_er_evaluator import EmpathyEREvaluator

# Initialize evaluator
evaluator = EmpathyEREvaluator()

# Evaluate conversation
result = evaluator.execute(conversation)

# Access results
for utterance_result in result["per_utterance"]:
    metrics = utterance_result["metrics"]
    if "empathy_er" in metrics:
        score = metrics["empathy_er"]
        print(f"Label: {score['label']}, Confidence: {score['confidence']}")

Adding New Evaluators

To add a new evaluator:

  1. Create a new file in impl/ directory
  2. Inherit from Evaluator base class
  3. Register using @register_evaluator decorator
  4. Implement execute() method that returns EvaluationResult
  5. Use helper functions from utils.evaluation_helpers:
    • create_categorical_score() for categorical scores
    • create_numerical_score() for numerical scores
    • create_utterance_result() for utterance-level results
    • create_conversation_result() for conversation-level results
    • create_segment_result() for segment-level results

Example:

from evaluators.base import Evaluator
from evaluators.registry import register_evaluator
from custom_types import Utterance, EvaluationResult
from utils.evaluation_helpers import create_categorical_score, create_utterance_result

@register_evaluator(
    "my_metric",
    label="My Metric",
    description="Description of what this metric measures",
    category="Category Name"
)
class MyEvaluator(Evaluator):
    METRIC_NAME = "my_metric"

    def execute(self, conversation: List[Utterance], **kwargs) -> EvaluationResult:
        # Implementation
        pass