Spaces:

RyanDDD
/

hhh

Sleeping

App Files Files Community

hhh / evaluators /README.md

github-actions[bot]

Deploy from GitHub Actions (commit: 8b247ffacd77c0672965b8378f1d52a7dcd187ae)

9366995 30 days ago

preview code

raw

history blame contribute delete

12.2 kB

	# Evaluators Documentation

	This directory contains all evaluator implementations for the Therapist Conversation Evaluator tool. Each evaluator measures different aspects of therapeutic conversations.

	## Overview

	All evaluators follow a consistent interface and return standardized `EvaluationResult` objects. The result type depends on the granularity level:
	- Utterance-level: Per-utterance scores (`granularity="utterance"`)
	- Segment-level: Multi-utterance segment scores (`granularity="segment"`)
	- Conversation-level: Overall conversation scores (`granularity="conversation"`)

	## Evaluation Result Types

	### Score Types

	1. Categorical Score: Discrete labels (e.g., "High", "Medium", "Low")
	```python
	{
	"type": "categorical",
	"label": "High",
	"confidence": 0.85 # Optional: 0-1 confidence score
	}
	```

	2. Numerical Score: Continuous values with max bounds (e.g., 1-5 scale)
	```python
	{
	"type": "numerical",
	"value": 4.0,
	"max_value": 5.0,
	"label": "High" # Optional: derived label
	}
	```

	### Granularity Levels

	- Utterance-level: Scores for each individual utterance in the conversation
	- Segment-level: Aggregate scores for multi-utterance segments
	- Conversation-level: Overall scores for the entire conversation

	---

	## Evaluators

	### 1. Empathy ER (Emotional Reaction) Evaluator

	File: `empathy_er_evaluator.py`

	Description: Measures the emotional reaction component of empathy in therapeutic responses. Evaluates how well the therapist responds to the patient's emotional state.

	Model: `RyanDDD/empathy-mental-health-reddit-ER`

	Result Type:
	- Granularity: Utterance-level
	- Score Type: Categorical (3 labels)
	- Labels: `["Low", "Medium", "High"]`
	- Evaluates: Therapist responses only

	Output Format:
	```python
	{
	"granularity": "utterance",
	"per_utterance": [
	{
	"index": 0,
	"metrics": {
	"empathy_er": {
	"type": "categorical",
	"label": "High",
	"confidence": 0.92
	}
	}
	}
	]
	}
	```

	---

	### 2. Empathy IP (Interpretation) Evaluator

	File: `empathy_ip_evaluator.py`

	Description: Measures the interpretation component of empathy in therapeutic responses. Evaluates how well the therapist interprets and understands the patient's situation.

	Model: `RyanDDD/empathy-mental-health-reddit-IP`

	Result Type:
	- Granularity: Utterance-level
	- Score Type: Categorical (3 labels)
	- Labels: `["Low", "Medium", "High"]`
	- Evaluates: Therapist responses only

	Output Format:
	```python
	{
	"granularity": "utterance",
	"per_utterance": [
	{
	"index": 0,
	"metrics": {
	"empathy_ip": {
	"type": "categorical",
	"label": "Medium",
	"confidence": 0.78
	}
	}
	}
	]
	}
	```

	---

	### 3. Empathy EX (Exploration) Evaluator

	File: `empathy_ex_evaluator.py`

	Description: Measures the exploration component of empathy in therapeutic responses. Evaluates how well the therapist explores and deepens understanding of the patient's concerns.

	Model: `RyanDDD/empathy-mental-health-reddit-EX`

	Result Type:
	- Granularity: Utterance-level
	- Score Type: Categorical (3 labels)
	- Labels: `["Low", "Medium", "High"]`
	- Evaluates: Therapist responses only

	Output Format:
	```python
	{
	"granularity": "utterance",
	"per_utterance": [
	{
	"index": 0,
	"metrics": {
	"empathy_ex": {
	"type": "categorical",
	"label": "High",
	"confidence": 0.89
	}
	}
	}
	]
	}
	```

	---

	### 4. Talk Type Evaluator

	File: `talk_type_evaluator.py`

	Description: Classifies patient utterances into change talk, sustain talk, or neutral. Uses BERT model trained on motivational interviewing data.

	Model: `RyanDDD/bert-motivational-interviewing`

	Result Type:
	- Granularity: Utterance-level
	- Score Type: Categorical (3 labels)
	- Labels: `["Change", "Neutral", "Sustain"]`
	- Evaluates: Patient utterances only

	Output Format:
	```python
	{
	"granularity": "utterance",
	"per_utterance": [
	{
	"index": 0,
	"metrics": {
	"talk_type": {
	"type": "categorical",
	"label": "Change",
	"confidence": 0.91
	}
	}
	}
	]
	}
	```

	---

	### 5. Mental Health Factuality Evaluator

	File: `factuality_evaluator.py`

	Description: LLM-as-Judge scoring of assistant responses for clinical accuracy, safety, scope appropriateness, evidence-based practice, and overall quality. Uses strict rubric to evaluate mental health chat responses.

	Model: OpenAI GPT-4o (configurable)

	Result Type:
	- Granularity: Utterance-level
	- Score Type: Numerical (5-point scale: 1-5)
	- Scale: 1-5 for all dimensions
	- Evaluates: Assistant/therapist responses only

	Dimensions:
	1. `overall_score` (1-5): Overall factuality and quality
	2. `clinical_accuracy` (1-5): Clinical accuracy of information
	3. `safety` (1-5): Safety of the response
	4. `scope_appropriateness` (1-5): Appropriateness of scope
	5. `evidence_based` (1-5): Evidence-based practice alignment
	6. `explanation` (text): Reasoning for the scores

	Output Format:
	```python
	{
	"granularity": "utterance",
	"per_utterance": [
	{
	"index": 0,
	"metrics": {
	"overall": {
	"type": "numerical",
	"value": 4,
	"max_value": 5
	},
	"clinical_accuracy": {
	"type": "numerical",
	"value": 4,
	"max_value": 5
	},
	"safety": {
	"type": "numerical",
	"value": 5,
	"max_value": 5
	},
	"scope": {
	"type": "numerical",
	"value": 4,
	"max_value": 5
	},
	"evidence": {
	"type": "numerical",
	"value": 3,
	"max_value": 5
	},
	"explanation": {
	"type": "text",
	"value": "Response demonstrates good clinical accuracy..."
	}
	}
	}
	]
	}
	```

	Scoring Rules:
	- Conservative scoring: Prefers 3 (neutral) for generic responses without clinical claims
	- Safety gate: If safety ≤ 2, overall_score is capped at safety level
	- Evidence requirement: Avoids 5 unless response references specific evidence-based techniques
	- No clinical claims: If response lacks clinical terms, clinical_accuracy and evidence_based are capped at 3

	---

	### 6. Emotion Analysis Evaluator

	File: `emotion_evaluator.py`

	Description: Analyzes user emotions using emotion classification model. Calculates negative emotion sum, joy/neutral shift, and tracks emotion change trends across the conversation.

	Model: `j-hartmann/emotion-english-distilroberta-base`

	Result Type:
	- Granularity: Utterance-level (with overall trend)
	- Score Type: Numerical (with categorical labels)
	- Evaluates: User/patient utterances only

	Metrics:
	1. `emotion_sum_negative` (0.0-1.0): Sum of negative emotions (anger, disgust, fear, sadness)
	- Labels: "Low" (< 0.2), "Medium" (0.2-0.5), "High" (> 0.5)
	2. `emotion_joy_neutral_shift` (-1.0 to 1.0): Difference between joy and neutral emotions
	- Labels: "Positive" (> 0.2), "Neutral" (-0.2 to 0.2), "Negative" (< -0.2)
	3. `emotion_trend_direction` (overall): Trend analysis across conversation
	- Labels: "improving", "declining", "stable", "neutral"

	Emotion Labels: `["anger", "disgust", "fear", "joy", "neutral", "sadness", "surprise"]`

	Output Format:
	```python
	{
	"granularity": "utterance",
	"per_utterance": [
	{
	"index": 0,
	"metrics": {
	"emotion_sum_negative": {
	"type": "numerical",
	"value": 0.35,
	"max_value": 1.0,
	"label": "Medium"
	},
	"emotion_joy_neutral_shift": {
	"type": "numerical",
	"value": -0.15,
	"max_value": 1.0,
	"label": "Neutral"
	}
	}
	}
	],
	"overall": {
	"emotion_avg_sum_negative": {
	"type": "numerical",
	"value": 0.28,
	"max_value": 1.0,
	"label": "improving"
	},
	"emotion_avg_joy_neutral_shift": {
	"type": "numerical",
	"value": 0.12,
	"max_value": 1.0,
	"label": "improving"
	},
	"emotion_trend_direction": {
	"type": "categorical",
	"label": "improving"
	}
	}
	}
	```

	Trend Analysis:
	- Compares first half vs second half of conversation
	- "improving": Negative emotions decrease AND joy/neutral shift increases
	- "declining": Negative emotions increase AND joy/neutral shift decreases
	- "stable": No significant change
	- "neutral": Insufficient data

	---

	## Summary Table

	\| Evaluator \| Granularity \| Score Type \| Labels/Scale \| Evaluates \| Model \|
	\|-----------\|-------------\|-----------\|--------------\|-----------\|-------\|
	\| Empathy ER \| Utterance \| Categorical \| 3 labels (Low/Medium/High) \| Therapist \| `RyanDDD/empathy-mental-health-reddit-ER` \|
	\| Empathy IP \| Utterance \| Categorical \| 3 labels (Low/Medium/High) \| Therapist \| `RyanDDD/empathy-mental-health-reddit-IP` \|
	\| Empathy EX \| Utterance \| Categorical \| 3 labels (Low/Medium/High) \| Therapist \| `RyanDDD/empathy-mental-health-reddit-EX` \|
	\| Talk Type \| Utterance \| Categorical \| 3 labels (Change/Neutral/Sustain) \| Patient \| `RyanDDD/bert-motivational-interviewing` \|
	\| Factuality \| Utterance \| Numerical \| 5-point scale (1-5) \| Therapist \| OpenAI GPT-4o \|
	\| Emotion Analysis \| Utterance + Overall \| Numerical \| 0.0-1.0 (with labels) \| Patient \| `j-hartmann/emotion-english-distilroberta-base` \|

	---

	## Usage

	All evaluators follow the same interface:

	```python
	from evaluators.impl.empathy_er_evaluator import EmpathyEREvaluator

	# Initialize evaluator
	evaluator = EmpathyEREvaluator()

	# Evaluate conversation
	result = evaluator.execute(conversation)

	# Access results
	for utterance_result in result["per_utterance"]:
	metrics = utterance_result["metrics"]
	if "empathy_er" in metrics:
	score = metrics["empathy_er"]
	print(f"Label: {score['label']}, Confidence: {score['confidence']}")
	```

	---

	## Adding New Evaluators

	To add a new evaluator:

	1. Create a new file in `impl/` directory
	2. Inherit from `Evaluator` base class
	3. Register using `@register_evaluator` decorator
	4. Implement `execute()` method that returns `EvaluationResult`
	5. Use helper functions from `utils.evaluation_helpers`:
	- `create_categorical_score()` for categorical scores
	- `create_numerical_score()` for numerical scores
	- `create_utterance_result()` for utterance-level results
	- `create_conversation_result()` for conversation-level results
	- `create_segment_result()` for segment-level results

	Example:
	```python
	from evaluators.base import Evaluator
	from evaluators.registry import register_evaluator
	from custom_types import Utterance, EvaluationResult
	from utils.evaluation_helpers import create_categorical_score, create_utterance_result

	@register_evaluator(
	"my_metric",
	label="My Metric",
	description="Description of what this metric measures",
	category="Category Name"
	)
	class MyEvaluator(Evaluator):
	METRIC_NAME = "my_metric"

	def execute(self, conversation: List[Utterance], **kwargs) -> EvaluationResult:
	# Implementation
	pass
	```