Spaces:

taboola-cz
/

sel-chat-coach

Running

App Files Files Community

tblaisaacliao commited on Dec 26, 2025

Commit

dca7537

1 Parent(s): 710cde6

refine evaluation and fix CSV download problem

Browse files

Files changed (6) hide show

docs/backend-doc/14-evaluation-system.md +413 -0
src/app/admin/conversations/[conversationId]/page.tsx +2 -2
src/app/admin/conversations/page.tsx +26 -8
src/app/admin/evaluations/[id]/page.tsx +2 -2
src/app/api/conversations/create/route.ts +15 -1
src/lib/services/evaluation-service.ts +16 -6

docs/backend-doc/14-evaluation-system.md ADDED Viewed

	@@ -0,0 +1,413 @@

+# Evaluation System
+This document explains how the AI-based conversation evaluation system works.
+## Overview
+The evaluation system uses AI (OpenAI) to assess conversation quality for teacher training purposes. It supports two evaluation modes:
+- **Student conversations**: Evaluates how well the AI student simulation helps train teachers
+- **Coach-direct conversations**: Evaluates direct teacher-coach interactions (no student)
+## Key Files
+| File | Purpose |
+|------|---------|
+| `src/lib/services/evaluation-service.ts` | Core evaluation logic and AI integration |
+| `src/lib/repositories/evaluation-repository.ts` | Database operations |
+| `src/lib/types/models.ts` | TypeScript types (lines 83-150) |
+| `src/app/api/admin/evaluations/` | API endpoints |
+| `src/app/admin/evaluations/` | Admin UI pages |
+---
+## API Endpoints
+### List Evaluations
+```
+GET /api/admin/evaluations
+```
+**Query Parameters:**
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `page` | number | 1 | Page number |
+| `limit` | number | 50 | Results per page |
+| `sortBy` | string | `evaluatedAt` | Sort by `evaluatedAt` or `overallScore` |
+| `sortOrder` | string | `desc` | `asc` or `desc` |
+| `evaluationType` | string | - | Filter by type |
+| `studentPromptId` | string | - | Filter by student personality |
+| `minScore` | number | - | Minimum score filter |
+| `maxScore` | number | - | Maximum score filter |
+| `startDate` | string | - | Start date filter (ISO) |
+| `endDate` | string | - | End date filter (ISO) |
+**Response:**
+```json
+{
+  "evaluations": [...],
+  "pagination": {
+    "page": 1,
+    "limit": 50,
+    "total": 100,
+    "totalPages": 2
+  }
+}
+```
+### Trigger Batch Evaluation
+```
+POST /api/admin/evaluations
+```
+**Request Body:**
+```json
+{
+  "conversationIds": ["uuid-1", "uuid-2", "uuid-3"]
+}
+```
+**Constraints:**
+- Maximum 10 conversations per batch
+- Array must be non-empty
+**Response:**
+```json
+{
+  "successful": [...],
+  "failed": [{ "conversationId": "...", "error": "..." }],
+  "summary": {
+    "total": 3,
+    "successful": 2,
+    "failed": 1
+  }
+}
+```
+### Get Single Evaluation
+```
+GET /api/admin/evaluations/[id]
+```
+### Get Evaluation by Conversation
+```
+GET /api/admin/evaluations/conversation/[conversationId]
+```
+### Trigger Single Evaluation
+```
+POST /api/admin/evaluations/conversation/[conversationId]
+```
+**Query Parameters:**
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `force` | boolean | false | Force re-evaluation (deletes existing) |
+### Get Statistics
+```
+GET /api/admin/evaluations/stats
+GET /api/admin/evaluations/stats?studentPromptId=grade_1
+```
+---
+## Evaluation Flow
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                    API Request Received                          │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  1. Load Data                                                    │
+│     - Fetch conversation from ConversationRepository             │
+│     - Fetch messages from MessageRepository                      │
+│     - Check if evaluation already exists                         │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  2. Detect Conversation Type                                     │
+│     - If studentPromptId === 'coach_direct' → coach_direct mode  │
+│     - Otherwise → student mode                                   │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  3. Format Conversation for AI                                   │
+│     - Filter out system messages                                 │
+│     - Map roles to Chinese labels:                               │
+│       • user → "老師" (Teacher)                                  │
+│       • assistant → "學生"/"教練" based on speaker field         │
+│     - Coach-direct: only "老師" and "教練"                       │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  4. Call AI Model (OpenAI)                                       │
+│     - Model: gpt-4o-mini (or MODEL_NAME env var)                 │
+│     - Temperature: 0.3 (for consistency)                         │
+│     - System prompt: based on conversation type                  │
+│     - User prompt: formatted conversation + system prompt        │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  5. Parse AI Response                                            │
+│     - Extract JSON from response                                 │
+│     - Validate required fields                                   │
+│     - Build EvaluationScores and EvaluationFeedback objects      │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  6. Save to Database                                             │
+│     - Generate UUID                                              │
+│     - Serialize scores/feedback to JSON                          │
+│     - Insert into evaluations table                              │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                    Return Evaluation Object                      │
+└─────────────────────────────────────────────────────────────────┘
+```
+---
+## Response Schema (Critical)
+When modifying evaluation prompts, the AI response **must** follow this exact JSON structure. The `parseEvaluationResponse()` function validates these fields.
+### Required JSON Structure
+```json
+{
+  "teacherEngagement": {
+    "level": "high|medium|low|none",
+    "warning": "string (empty if level is high/medium)"
+  },
+  "promptDesign": {
+    "clarity": 1-5,
+    "completeness": 1-5,
+    "specificity": 1-5,
+    "consistency": 1-5,
+    "overall": 1-5,
+    "rationale": "string"
+  },
+  "trainingEffectiveness": {
+    "challengeLevel": 1-5,
+    "learningOpportunities": 1-5,
+    "realisticScenarios": 1-5,
+    "engagementDepth": 1-5,
+    "overall": 1-5,
+    "rationale": "string"
+  },
+  "conversationQuality": {
+    "teacherInsights": 1-5,
+    "interactionDepth": 1-5,
+    "educationalValue": 1-5,
+    "overall": 1-5,
+    "rationale": "string"
+  },
+  "overallScore": 1-5,
+  "strengths": ["string array"],
+  "improvementAreas": ["string array"],
+  "promptSuggestions": ["string array"]
+}
+```
+### Validation Rules
+The parser checks:
+- `promptDesign`, `trainingEffectiveness`, `conversationQuality` objects must exist
+- `overallScore` must be a number
+- Missing optional fields (`teacherEngagement`, arrays) default to empty values
+---
+## Modifying Evaluation Prompts
+### Prompt Locations
+| Prompt | Variable | Purpose |
+|--------|----------|---------|
+| Student evaluation | `EVALUATION_SYSTEM_PROMPT` | Evaluates student conversations |
+| Coach-direct evaluation | `COACH_DIRECT_EVALUATION_SYSTEM_PROMPT` | Evaluates coach-only conversations |
+File: `src/lib/services/evaluation-service.ts` (lines 21-200)
+### What You CAN Change
+- Evaluation criteria descriptions
+- Score weighting explanations
+- Chinese text and examples
+- `teacherEngagement.level` thresholds
+### What You MUST Keep
+- The JSON structure defined in "Response Schema" section above
+- Field names (e.g., `promptDesign.clarity`, `conversationQuality.overall`)
+- Score range (1-5)
+- The instruction: `只返回有效的 JSON，不要其他文字`
+### Feedback Field Audience
+`improvementAreas` and `promptSuggestions` are for **prompt engineers**, not teachers:
+- ❌ Wrong: "老師可以多用開放式問句"
+- ✅ Correct: "系統提示應增加學生對教師特定技巧的回應指示"
+---
+## Conversation Type Handling
+### Student Conversations
+**Condition:** `studentPromptId !== 'coach_direct'`
+**Evaluation Focus:**
+- Prompt design quality (weight: 20%)
+- Training effectiveness (weight: 20%)
+- Teacher experience (weight: 60% - highest priority)
+**Message Labeling:**
+- User → "老師" (Teacher)
+- Assistant with `speaker === 'student'` → "學生" (Student)
+- Assistant with `speaker === 'coach'` → "教練" (Coach)
+**Overall Score Calculation:**
+```
+overallScore = 0.2 × promptDesign.overall
+             + 0.2 × trainingEffectiveness.overall
+             + 0.6 × conversationQuality.overall
+```
+### Coach-Direct Conversations
+**Condition:** `studentPromptId === 'coach_direct'`
+**Evaluation Focus:**
+- Coach guidance quality (weight: 50%)
+- Teacher learning effectiveness (weight: 50%)
+**Message Labeling:**
+- User → "老師" (Teacher)
+- Assistant → "教練" (Coach)
+**Overall Score Calculation:**
+```
+overallScore = 0.5 × promptDesign.overall
+             + 0.5 × trainingEffectiveness.overall
+```
+---
+## Scoring System
+All scores are 1-5 (1=Poor, 5=Excellent). See "Response Schema" section for field names.
+### Score Weights
+| Mode | promptDesign | trainingEffectiveness | conversationQuality |
+|------|--------------|----------------------|---------------------|
+| Student | 20% | 20% | 60% |
+| Coach-direct | 50% | 50% | (included in feedback) |
+### Score Color Coding (UI)
+| Score Range | Color | Meaning |
+|-------------|-------|---------|
+| ≥ 4.0 | Green | Excellent |
+| 3.0 - 3.9 | Yellow | Good |
+| < 3.0 | Red | Needs Improvement |
+---
+## Feedback Structure
+See "Response Schema" section for full field structure.
+### Teacher Engagement Levels
+| Level | Description |
+|-------|-------------|
+| `high` | Actively engaged, meaningful questions |
+| `medium` | Participated but shallow interaction |
+| `low` | Low participation, brief responses |
+| `none` | Meaningless input (random numbers, gibberish) |
+**Low scores are triggered by:**
+- Random numbers or gibberish input
+- Conversations too short
+- Teacher clearly not serious
+---
+## Database Schema
+### evaluations Table
+```sql
+CREATE TABLE IF NOT EXISTS evaluations (
+  id TEXT PRIMARY KEY,
+  conversation_id TEXT NOT NULL,
+  student_prompt_id TEXT,
+  evaluation_type TEXT NOT NULL,
+  model_used TEXT NOT NULL,
+  evaluated_at TEXT NOT NULL,
+  evaluated_by TEXT,
+  overall_score REAL,
+  scores TEXT NOT NULL,        -- JSON string
+  feedback TEXT NOT NULL,      -- JSON string
+  raw_response TEXT,
+  created_at TEXT NOT NULL
+);
+```
+### Indexes
+```sql
+CREATE INDEX idx_evaluations_conversation ON evaluations(conversation_id);
+CREATE INDEX idx_evaluations_type ON evaluations(evaluation_type);
+CREATE INDEX idx_evaluations_prompt ON evaluations(student_prompt_id);
+CREATE INDEX idx_evaluations_score ON evaluations(overall_score);
+CREATE INDEX idx_evaluations_date ON evaluations(evaluated_at);
+```
+---
+## TypeScript Types
+See `src/lib/types/models.ts` (lines 83-150) for full type definitions.
+### Evaluation (main object)
+```typescript
+interface Evaluation {
+  id: string;
+  conversationId: string;
+  studentPromptId?: string;
+  evaluationType: 'conversation_quality';
+  evaluationMode?: 'student' | 'coach_direct';  // Derived from studentPromptId
+  modelUsed: string;
+  evaluatedAt: string;
+  evaluatedBy?: string;
+  overallScore?: number;
+  scores: EvaluationScores;      // See "Response Schema" section
+  feedback: EvaluationFeedback;  // See "Response Schema" section
+  rawResponse?: string;
+  createdAt: string;
+}
+```
+---
+## Notes
+- `evaluationMode` is **not stored** in the database - it's derived from `studentPromptId` at read time
+- Raw AI response is stored for debugging and auditing purposes
+- Existing evaluations are skipped unless `force=true` is passed
+- Batch evaluations continue on individual failures (doesn't stop on first error)

src/app/admin/conversations/[conversationId]/page.tsx CHANGED Viewed

@@ -416,7 +416,7 @@ export default function AdminConversationDetailPage() {
                   </div>
                   <div className="p-4 bg-yellow-50 rounded-lg border border-yellow-200">
-                    <h4 className="text-sm font-semibold text-yellow-800 mb-2">Prompt Improvements</h4>
                     {evaluation.feedback.improvementAreas.length > 0 ? (
                       <ul className="text-sm text-yellow-700 space-y-1">
                         {evaluation.feedback.improvementAreas.map((a, i) => (
@@ -435,7 +435,7 @@ export default function AdminConversationDetailPage() {
                 {/* Prompt Suggestions */}
                 {evaluation.feedback.promptSuggestions.length > 0 && (
                   <div className="p-4 bg-blue-50 rounded-lg border border-blue-200">
-                    <h4 className="text-sm font-semibold text-blue-800 mb-2">Prompt Suggestions</h4>
                     <ul className="text-sm text-blue-700 space-y-1">
                       {evaluation.feedback.promptSuggestions.map((s, i) => (
                         <li key={i} className="flex items-start gap-2">

                   </div>
                   <div className="p-4 bg-yellow-50 rounded-lg border border-yellow-200">
+                    <h4 className="text-sm font-semibold text-yellow-800 mb-2">System Prompt Issues</h4>
                     {evaluation.feedback.improvementAreas.length > 0 ? (
                       <ul className="text-sm text-yellow-700 space-y-1">
                         {evaluation.feedback.improvementAreas.map((a, i) => (
                 {/* Prompt Suggestions */}
                 {evaluation.feedback.promptSuggestions.length > 0 && (
                   <div className="p-4 bg-blue-50 rounded-lg border border-blue-200">
+                    <h4 className="text-sm font-semibold text-blue-800 mb-2">System Prompt Suggestions</h4>
                     <ul className="text-sm text-blue-700 space-y-1">
                       {evaluation.feedback.promptSuggestions.map((s, i) => (
                         <li key={i} className="flex items-start gap-2">

src/app/admin/conversations/page.tsx CHANGED Viewed

@@ -76,6 +76,26 @@ export default function AdminConversationsPage() {
     setPage(1); // Reset to first page when filters change
   };
   return (
     <div className="p-4 md:p-8">
       {/* Header */}
@@ -274,13 +294,12 @@ export default function AdminConversationsPage() {
                           >
                             View Messages
                           </Link>
-                          <a
-                            href={`/api/admin/conversations/${conv.id}/export/tsv`}
                             className="text-green-600 hover:text-green-900"
-                            download
                           >
                             Download
-                          </a>
                         </div>
                       </td>
                     </tr>
@@ -338,13 +357,12 @@ export default function AdminConversationsPage() {
                     >
                       View Messages
                     </Link>
-                    <a
-                      href={`/api/admin/conversations/${conv.id}/export/tsv`}
                       className="flex-1 text-center px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700"
-                      download
                     >
                       Download
-                    </a>
                   </div>
                 </div>
               ))

     setPage(1); // Reset to first page when filters change
   };
+  const handleDownload = async (conversationId: string, title: string) => {
+    try {
+      const response = await adminFetch(`/api/admin/conversations/${conversationId}/export/tsv`);
+      if (!response.ok) {
+        throw new Error('Failed to download conversation');
+      }
+      const blob = await response.blob();
+      const url = URL.createObjectURL(blob);
+      const a = document.createElement('a');
+      a.href = url;
+      const sanitizedTitle = (title || conversationId).replace(/[<>:"/\\|?*\x00-\x1F]/g, '_').replace(/\s+/g, '_').substring(0, 100);
+      a.download = `conversation_${sanitizedTitle}_${new Date().toISOString().split('T')[0]}.tsv`;
+      a.click();
+      URL.revokeObjectURL(url);
+    } catch (err) {
+      console.error('Download error:', err);
+      alert('Failed to download conversation. Please try again.');
+    }
+  };
   return (
     <div className="p-4 md:p-8">
       {/* Header */}
                           >
                             View Messages
                           </Link>
+                          <button
+                            onClick={() => handleDownload(conv.id, conv.title)}
                             className="text-green-600 hover:text-green-900"
                           >
                             Download
+                          </button>
                         </div>
                       </td>
                     </tr>
                     >
                       View Messages
                     </Link>
+                    <button
+                      onClick={() => handleDownload(conv.id, conv.title)}
                       className="flex-1 text-center px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700"
                     >
                       Download
+                    </button>
                   </div>
                 </div>
               ))

src/app/admin/evaluations/[id]/page.tsx CHANGED Viewed

@@ -369,7 +369,7 @@ export default function EvaluationDetailPage({
         {/* Improvement Areas */}
         <div className="bg-white rounded-lg shadow p-6">
           <h3 className="text-lg font-semibold text-gray-900 mb-4 flex items-center gap-2">
-            <span className="text-yellow-500">!</span> {evaluation.evaluationMode === 'coach_direct' ? '教練改進空間' : 'Prompt Improvements'}
           </h3>
           {evaluation.feedback.improvementAreas.length > 0 ? (
             <ul className="space-y-2">
@@ -389,7 +389,7 @@ export default function EvaluationDetailPage({
       {/* Prompt Suggestions */}
       <div className="bg-blue-50 border border-blue-200 rounded-lg p-6 mb-6">
         <h3 className="text-lg font-semibold text-blue-900 mb-4 flex items-center gap-2">
-          <span>💡</span> {evaluation.evaluationMode === 'coach_direct' ? '教練提示改進建議' : 'Prompt Improvement Suggestions'}
         </h3>
         {evaluation.feedback.promptSuggestions.length > 0 ? (
           <ul className="space-y-2">

         {/* Improvement Areas */}
         <div className="bg-white rounded-lg shadow p-6">
           <h3 className="text-lg font-semibold text-gray-900 mb-4 flex items-center gap-2">
+            <span className="text-yellow-500">!</span> {evaluation.evaluationMode === 'coach_direct' ? '教練改進空間' : 'System Prompt Issues'}
           </h3>
           {evaluation.feedback.improvementAreas.length > 0 ? (
             <ul className="space-y-2">
       {/* Prompt Suggestions */}
       <div className="bg-blue-50 border border-blue-200 rounded-lg p-6 mb-6">
         <h3 className="text-lg font-semibold text-blue-900 mb-4 flex items-center gap-2">
+          <span>💡</span> {evaluation.evaluationMode === 'coach_direct' ? '教練提示改進建議' : 'System Prompt Suggestions (For Prompt Engineers)'}
         </h3>
         {evaluation.feedback.promptSuggestions.length > 0 ? (
           <ul className="space-y-2">

src/app/api/conversations/create/route.ts CHANGED Viewed

@@ -55,6 +55,20 @@ export async function POST(request: NextRequest) {
       systemPrompt = await promptService.getStudentPrompt(studentPromptId);
     }
     // Get coach info
     const coach = await promptService.getCoachPrompt(coachPromptId);
@@ -82,7 +96,7 @@ export async function POST(request: NextRequest) {
       userId,
       studentPromptId,
       coachPromptId as CoachType,
-      title,
       summary,
       systemPrompt
     );

       systemPrompt = await promptService.getStudentPrompt(studentPromptId);
     }
+    // Generate default title from student prompt if none provided
+    let defaultTitle = title;
+    if (!defaultTitle && studentPromptId !== 'coach_direct' && studentConfig) {
+      // Extract short context from description (last phrase after comma, or full if short)
+      const desc = studentConfig.description;
+      const shortContext = desc.includes('，')
+        ? desc.split('，').slice(-1)[0].substring(0, 20)
+        : desc.substring(0, 20);
+      defaultTitle = `${studentConfig.name} - ${shortContext}`;
+    }
+    if (!defaultTitle && studentPromptId === 'coach_direct') {
+      defaultTitle = '教練對話';
+    }
     // Get coach info
     const coach = await promptService.getCoachPrompt(coachPromptId);
       userId,
       studentPromptId,
       coachPromptId as CoachType,
+      defaultTitle,
       summary,
       systemPrompt
     );

src/lib/services/evaluation-service.ts CHANGED Viewed

@@ -91,11 +91,16 @@ overallScore = 0.5 × coachingQuality.overall + 0.5 × teacherLearning.overall
     "rationale": "<對話品質說明>"
   },
   "overallScore": <數字 1-5，按權重計算>,
-  "strengths": ["<教練的優點>"],
-  "improvementAreas": ["<教練需要改進的地方>"],
-  "promptSuggestions": ["<具體的教練提示修改建議>"]
 }
 **teacherEngagement.level 判斷標準：**
 - "high": 教師積極參與，提出有意義的問題和回應
 - "medium": 教師有參與但互動較淺
@@ -176,11 +181,16 @@ overallScore = 0.2 × promptDesign.overall + 0.2 × trainingEffectiveness.overal
     "rationale": "<教師體驗說明 - 描述教師在對話中的感受和學習>"
   },
   "overallScore": <數字 1-5，按權重計算>,
-  "strengths": ["<提示的優點>"],
-  "improvementAreas": ["<提示需要改進的地方>"],
-  "promptSuggestions": ["<具體的提示修改建議>"]
 }
 **teacherEngagement.level 判斷標準：**
 - "high": 教師積極參與，提出有意義的問題和回應
 - "medium": 教師有參與但互動較淺

     "rationale": "<對話品質說明>"
   },
   "overallScore": <數字 1-5，按權重計算>,
+  "strengths": ["<教練系統提示的優點 - 系統提示中有效的設計元素>"],
+  "improvementAreas": ["<教練系統提示的技術問題 - 提示工程師需要修正的提示詞問題，不是對教師的建議>"],
+  "promptSuggestions": ["<具體的系統提示修改建議 - 提供給提示工程師的具體文字/指令修改方案>"]
 }
+**重要：improvementAreas 和 promptSuggestions 是給提示工程師的技術建議，用於改進 AI 系統提示，而非給教師的建議。**
+錯誤範例（對教師的建議）：「老師可以多用開放式問句」「建議老師使用更同理的語氣」
+正確範例（對提示工程師的建議）：「系統提示應增加教練主動追問的指示」「建議在提示詞中加入更多情境判斷的引導語句」
 **teacherEngagement.level 判斷標準：**
 - "high": 教師積極參與，提出有意義的問題和回應
 - "medium": 教師有參與但互動較淺
     "rationale": "<教師體驗說明 - 描述教師在對話中的感受和學習>"
   },
   "overallScore": <數字 1-5，按權重計算>,
+  "strengths": ["<學生系統提示的優點 - 系統提示中有效的設計元素>"],
+  "improvementAreas": ["<學生系統提示的技術問題 - 提示工程師需要修正的提示詞問題，不是對教師的建議>"],
+  "promptSuggestions": ["<具體的系統提示修改建議 - 提供給提示工程師的具體文字/指令修改方案>"]
 }
+**重要：improvementAreas 和 promptSuggestions 是給提示工程師的技術建議，用於改進 AI 學生系統提示，而非給教師的建議。**
+錯誤範例（對教師的建議）：「老師可以多用開放式問句」「建議老師使用更同理的語氣」「老師的談話多停在鼓勵」
+正確範例（對提示工程師的建議）：「系統提示應增加學生對教師特定技巧的回應指示」「建議在提示詞中加入階段轉換的明確標記」「提示詞應指示AI學生在教師使用開放式問句時展現更多情緒開放」
 **teacherEngagement.level 判斷標準：**
 - "high": 教師積極參與，提出有意義的問題和回應
 - "medium": 教師有參與但互動較淺