Spaces:

vivekvish2004
/

openenv-customer-support

Sleeping

vivekvish2004 commited on 10 days ago

Commit

08e86b6

1 Parent(s): 44db10b

feat: proper task design — realistic scenarios, clearer graders, auto-validate

Task Design (evaluation criterion):
- tasks.json: 7 tasks with real-world scenarios embedded, example_input/actions,
scoring breakdowns, passing_threshold, actions_required per task
- EASY: binary single-action tasks (classify OR priority only)
- MEDIUM: 2-step tasks with partial credit (0.5 each)
- HARD: multi-step with strict per-component scoring
- env.py: 12 rich real-world scenarios covering all 6 categories:
refund/technical_issue/login_issue/general_inquiry/security/feedback
Each with context fields, varied sentiments, realistic urgency signals
- Better reward messages (emojis, specific feedback)
- Stricter resolve: requires classify+priority+response first
- Empathy detection improved for angry/panicked/concerned sentiment

Fix duplicate Operation ID warning in /reset endpoint

scripts/validate-submission.sh:
- Auto-starts server if not running (no manual server needed)
- Waits up to 15s for server ready
- Tests task design fields + difficulty spread
- Cleans up server after test
- Exit codes: 0=pass, 1=fail

Files changed (4) hide show

scripts/validate-submission.sh +197 -66
server/app.py +3 -2
server/env.py +170 -85
tasks.json +104 -28

scripts/validate-submission.sh CHANGED Viewed

@@ -1,132 +1,263 @@
 #!/usr/bin/env bash
 # ============================================================
 # OpenEnv Submission Validator
-# Tests all 4 evaluation criteria locally before submission
-# Usage: bash scripts/validate-submission.sh
 # ============================================================
-set -e
-BASE="http://localhost:7860"
 PASS=0
 FAIL=0
-RED='\033[0;31m'
-GREEN='\033[0;32m'
-YELLOW='\033[1;33m'
-NC='\033[0m'
 ok()   { echo -e "  ${GREEN}✅ PASS${NC} — $1"; ((PASS++)); }
 fail() { echo -e "  ${RED}❌ FAIL${NC} — $1"; ((FAIL++)); }
 info() { echo -e "  ${YELLOW}ℹ️  ${NC} $1"; }
 echo ""
 echo "╔══════════════════════════════════════════════════════╗"
 echo "║    OpenEnv Customer Support — Submission Validator   ║"
 echo "╚══════════════════════════════════════════════════════╝"
-echo ""
-# ── 1. Runtime Correctness ──────────────────────────────────
-echo "▶ 1. RUNTIME CORRECTNESS — Runs without errors"
-STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$BASE/health")
-if [ "$STATUS" = "200" ]; then
-  HEALTH=$(curl -s "$BASE/health")
-  if echo "$HEALTH" | grep -q '"healthy"'; then
-    ok "/health returns {status: healthy} (HTTP 200)"
   else
-    fail "/health response missing 'healthy': $HEALTH"
   fi
 else
-  fail "/health endpoint unreachable (HTTP $STATUS)"
 fi
 STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$BASE/openapi.json")
-[ "$STATUS" = "200" ] && ok "/openapi.json reachable (HTTP 200)" || fail "/openapi.json not found"
-echo ""
-# ── 2. Interface Compliance ─────────────────────────────────
-echo "▶ 2. INTERFACE COMPLIANCE — Follows OpenEnv standard"
 # /metadata
 META=$(curl -s "$BASE/metadata")
 echo "$META" | grep -q '"name"' && echo "$META" | grep -q '"description"' \
-  && ok "/metadata has name + description" \
-  || fail "/metadata missing required fields: $META"
 # /schema
 SCHEMA=$(curl -s "$BASE/schema")
-echo "$SCHEMA" | grep -q '"action"' && echo "$SCHEMA" | grep -q '"observation"' && echo "$SCHEMA" | grep -q '"state"' \
-  && ok "/schema has action + observation + state" \
-  || fail "/schema missing required fields"
-# /reset
 RESET=$(curl -s -X POST "$BASE/reset")
-echo "$RESET" | grep -q '"observation"' \
-  && ok "/reset returns observation" \
-  || fail "/reset bad response: $RESET"
-# /step
 STEP=$(curl -s -X POST "$BASE/step" \
   -H "Content-Type: application/json" \
-  -d '{"action_type":"classify_ticket","payload":{"classification":"refund"}}')
-echo "$STEP" | grep -q '"observation"' && echo "$STEP" | grep -q '"reward"' && echo "$STEP" | grep -q '"done"' \
-  && ok "/step returns observation + reward + done" \
-  || fail "/step bad response: $STEP"
-# /state
 STATE=$(curl -s "$BASE/state")
-echo "$STATE" | grep -q '"observation"' \
-  && ok "/state returns observation" \
-  || fail "/state bad response: $STATE"
-echo ""
-# ── 3. Task Design ──────────────────────────────────────────
-echo "▶ 3. TASK DESIGN — Clear, realistic, testable"
 TASKS=$(curl -s "$BASE/tasks")
-TASK_COUNT=$(echo "$TASKS" | python3 -c "import sys,json; t=json.load(sys.stdin); print(len(t))" 2>/dev/null)
 GRADED_COUNT=$(echo "$TASKS" | python3 -c "import sys,json; t=json.load(sys.stdin); print(sum(1 for x in t if x.get('grader')))" 2>/dev/null)
-info "Found $TASK_COUNT tasks, $GRADED_COUNT with graders"
-[ "$TASK_COUNT" -ge 3 ] && ok "At least 3 tasks defined" || fail "Need ≥3 tasks, found $TASK_COUNT"
-[ "$GRADED_COUNT" -ge 3 ] && ok "At least 3 tasks have graders" || fail "Need ≥3 tasks with graders, found $GRADED_COUNT"
-echo ""
-# ── 4. Grading Logic ────────────────────────────────────────
-echo "▶ 4. GRADING LOGIC — Reward system makes sense"
-TASK_IDS=$(echo "$TASKS" | python3 -c "import sys,json; t=json.load(sys.stdin); [print(x['id']) for x in t if x.get('grader')]" 2>/dev/null)
 GRADER_OK=0
 GRADER_FAIL=0
 for TID in $TASK_IDS; do
   RESULT=$(curl -s "$BASE/grader?task_id=$TID")
-  SCORE=$(echo "$RESULT" | python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('score',''))" 2>/dev/null)
-  if [ -n "$SCORE" ]; then
-    IN_RANGE=$(python3 -c "s=float('$SCORE'); print('ok' if 0.0<=s<=1.0 else 'fail')" 2>/dev/null)
-    if [ "$IN_RANGE" = "ok" ]; then
-      info "$TID → score=$SCORE ✅"
-      ((GRADER_OK++))
-    else
-      info "$TID → score=$SCORE ❌ (out of range)"
-      ((GRADER_FAIL++))
-    fi
   else
-    info "$TID → grader error: $RESULT"
     ((GRADER_FAIL++))
   fi
 done
-[ "$GRADER_OK" -ge 3 ] && ok "$GRADER_OK graders return valid scores in [0.0, 1.0]" \
-  || fail "Only $GRADER_OK graders valid, need ≥3"
 echo ""
-echo "══════════════════════════════════════════════════════"
 echo -e "  Results: ${GREEN}$PASS passed${NC}  |  ${RED}$FAIL failed${NC}"
-echo "══════════════════════════════════════════════════════"
 echo ""
 if [ "$FAIL" -eq 0 ]; then
   echo -e "${GREEN}🎉 ALL CHECKS PASSED — Ready for OpenEnv submission!${NC}"
   exit 0

 #!/usr/bin/env bash
 # ============================================================
 # OpenEnv Submission Validator
+# Tests all 4 evaluation criteria — auto-starts server
+# Usage: bash scripts/validate-submission.sh [port]
 # ============================================================
+PORT="${1:-7860}"
+BASE="http://localhost:$PORT"
 PASS=0
 FAIL=0
+SERVER_STARTED=false
+RED='\033[0;31m'; GREEN='\033[0;32m'; YELLOW='\033[1;33m'; BLUE='\033[0;34m'; NC='\033[0m'
 ok()   { echo -e "  ${GREEN}✅ PASS${NC} — $1"; ((PASS++)); }
 fail() { echo -e "  ${RED}❌ FAIL${NC} — $1"; ((FAIL++)); }
 info() { echo -e "  ${YELLOW}ℹ️  ${NC} $1"; }
+hdr()  { echo -e "\n${BLUE}▶ $1${NC}"; }
 echo ""
 echo "╔══════════════════════════════════════════════════════╗"
 echo "║    OpenEnv Customer Support — Submission Validator   ║"
 echo "╚══════════════════════════════════════════════════════╝"
+# ── Auto-start server if not running ─────────────────────────
+echo ""
+echo "Checking server at $BASE ..."
+if ! curl -s --max-time 2 "$BASE/health" > /dev/null 2>&1; then
+  echo -e "${YELLOW}⚡ Server not running — starting on port $PORT ...${NC}"
+  # Find python/uvicorn
+  if [ -f ".venv/bin/python" ]; then
+    PYTHON=".venv/bin/python"
+  elif command -v python3 &>/dev/null; then
+    PYTHON="python3"
   else
+    echo -e "${RED}❌ Python not found. Activate venv first.${NC}"
+    exit 1
   fi
+  $PYTHON -m uvicorn server.app:app --host 0.0.0.0 --port "$PORT" --log-level warning &
+  SERVER_PID=$!
+  SERVER_STARTED=true
+  # Wait for server to be ready (up to 15s)
+  for i in {1..15}; do
+    sleep 1
+    if curl -s --max-time 1 "$BASE/health" > /dev/null 2>&1; then
+      echo -e "${GREEN}✅ Server ready on port $PORT${NC}"
+      break
+    fi
+    if [ "$i" -eq 15 ]; then
+      echo -e "${RED}❌ Server failed to start after 15s${NC}"
+      kill $SERVER_PID 2>/dev/null
+      exit 1
+    fi
+  done
+else
+  echo -e "${GREEN}✅ Server already running${NC}"
+fi
+echo ""
+# ══════════════════════════════════════════════════════════════
+# 1. RUNTIME CORRECTNESS — Runs without errors
+# ══════════════════════════════════════════════════════════════
+hdr "1. RUNTIME CORRECTNESS — Runs without errors"
+# Health
+HEALTH=$(curl -s "$BASE/health")
+if echo "$HEALTH" | grep -q '"healthy"'; then
+  ok "/health → {status: healthy}"
 else
+  fail "/health bad response: $HEALTH"
 fi
+# OpenAPI docs
 STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$BASE/openapi.json")
+[ "$STATUS" = "200" ] && ok "/openapi.json available (HTTP 200)" || fail "/openapi.json HTTP $STATUS"
+# Reset doesn't crash
+RESET=$(curl -s -X POST "$BASE/reset" 2>&1)
+if echo "$RESET" | python3 -c "import sys,json; json.load(sys.stdin)" 2>/dev/null; then
+  ok "/reset returns valid JSON"
+else
+  fail "/reset error: $RESET"
+fi
+# Step doesn't crash
+STEP=$(curl -s -X POST "$BASE/step" \
+  -H "Content-Type: application/json" \
+  -d '{"action_type":"classify_ticket","payload":{"classification":"refund"}}' 2>&1)
+if echo "$STEP" | python3 -c "import sys,json; json.load(sys.stdin)" 2>/dev/null; then
+  ok "/step returns valid JSON"
+else
+  fail "/step error: $STEP"
+fi
+# ══════════════════════════════════════════════════════════════
+# 2. INTERFACE COMPLIANCE — Follows OpenEnv standard
+# ══════��═══════════════════════════════════════════════════════
+hdr "2. INTERFACE COMPLIANCE — Follows OpenEnv standard"
 # /metadata
 META=$(curl -s "$BASE/metadata")
 echo "$META" | grep -q '"name"' && echo "$META" | grep -q '"description"' \
+  && ok "/metadata has required fields (name, description)" \
+  || fail "/metadata missing fields: $META"
 # /schema
 SCHEMA=$(curl -s "$BASE/schema")
+if echo "$SCHEMA" | python3 -c "
+import sys, json
+d = json.load(sys.stdin)
+assert 'action' in d and 'observation' in d and 'state' in d
+" 2>/dev/null; then
+  ok "/schema has action + observation + state"
+else
+  fail "/schema missing required fields: $SCHEMA"
+fi
+# /reset response shape
 RESET=$(curl -s -X POST "$BASE/reset")
+if echo "$RESET" | python3 -c "
+import sys, json
+d = json.load(sys.stdin)
+assert 'observation' in d and 'reward' in d and 'done' in d
+" 2>/dev/null; then
+  ok "/reset response has {observation, reward, done}"
+else
+  fail "/reset bad shape: $RESET"
+fi
+# /step response shape
 STEP=$(curl -s -X POST "$BASE/step" \
   -H "Content-Type: application/json" \
+  -d '{"action_type":"assign_priority","payload":{"priority":"high"}}')
+if echo "$STEP" | python3 -c "
+import sys, json
+d = json.load(sys.stdin)
+assert 'observation' in d
+assert 'reward' in d
+assert 'done' in d
+r = d['reward']
+assert isinstance(r, (int, float)), f'reward must be float, got {type(r)}'
+" 2>/dev/null; then
+  ok "/step returns {observation, reward(float), done, info}"
+else
+  fail "/step bad shape or reward not float: $STEP"
+fi
+# /state response shape
 STATE=$(curl -s "$BASE/state")
+echo "$STATE" | grep -q '"observation"' && ok "/state returns {observation}" || fail "/state bad shape: $STATE"
+# /mcp JSON-RPC
+MCP=$(curl -s -X POST "$BASE/mcp" -H "Content-Type: application/json" -d '{"jsonrpc":"2.0","method":"initialize","id":1}')
+echo "$MCP" | grep -q '"jsonrpc"' && ok "/mcp returns JSON-RPC 2.0 response" || fail "/mcp bad response: $MCP"
+# ══════════════════════════════════════════════════════════════
+# 3. TASK DESIGN — Clear, realistic, testable
+# ══════════════════════════════════════════════════════════════
+hdr "3. TASK DESIGN — Clear, realistic, testable"
 TASKS=$(curl -s "$BASE/tasks")
+TASK_COUNT=$(echo "$TASKS" | python3 -c "import sys,json; print(len(json.load(sys.stdin)))" 2>/dev/null)
 GRADED_COUNT=$(echo "$TASKS" | python3 -c "import sys,json; t=json.load(sys.stdin); print(sum(1 for x in t if x.get('grader')))" 2>/dev/null)
+info "Total tasks: $TASK_COUNT | Tasks with graders: $GRADED_COUNT"
+[ "${TASK_COUNT:-0}" -ge 3 ] \
+  && ok "≥3 tasks defined (found $TASK_COUNT)" \
+  || fail "Need ≥3 tasks, found ${TASK_COUNT:-0}"
+[ "${GRADED_COUNT:-0}" -ge 3 ] \
+  && ok "≥3 tasks have grader=true (found $GRADED_COUNT)" \
+  || fail "Need ≥3 graded tasks, found ${GRADED_COUNT:-0}"
+# Check tasks have required design fields
+DESIGN_OK=$(echo "$TASKS" | python3 -c "
+import sys, json
+tasks = json.load(sys.stdin)
+required = ['id','name','difficulty','objective','description']
+missing = []
+for t in tasks:
+    for f in required:
+        if f not in t:
+            missing.append(f'{t.get(\"id\",\"?\")}:{f}')
+print(len(missing))
+" 2>/dev/null)
+[ "${DESIGN_OK:-1}" -eq 0 ] \
+  && ok "All tasks have required design fields (id, name, difficulty, objective, description)" \
+  || fail "Some tasks missing design fields: $DESIGN_OK"
+# Check difficulty spread
+DIFF_SPREAD=$(echo "$TASKS" | python3 -c "
+import sys,json
+t = json.load(sys.stdin)
+diffs = set(x.get('difficulty','') for x in t)
+print('ok' if len(diffs) >= 2 else 'fail')
+" 2>/dev/null)
+[ "$DIFF_SPREAD" = "ok" ] \
+  && ok "Tasks span multiple difficulty levels (EASY / MEDIUM / HARD)" \
+  || fail "All tasks same difficulty — needs spread"
+# ══════════════════════════════════════════════════════════════
+# 4. GRADING LOGIC — Reward system makes sense
+# ══════════════════════════════════════════════════════════════
+hdr "4. GRADING LOGIC — Reward system in [0.0, 1.0]"
+TASK_IDS=$(echo "$TASKS" | python3 -c "
+import sys,json
+t=json.load(sys.stdin)
+for x in t:
+    if x.get('grader'): print(x['id'])
+" 2>/dev/null)
 GRADER_OK=0
 GRADER_FAIL=0
 for TID in $TASK_IDS; do
   RESULT=$(curl -s "$BASE/grader?task_id=$TID")
+  CHECK=$(echo "$RESULT" | python3 -c "
+import sys,json
+d=json.load(sys.stdin)
+s=float(d.get('score','-1'))
+if 0.0 <= s <= 1.0:
+    print(f'ok:{s}')
+else:
+    print(f'fail:{s}')
+" 2>/dev/null)
+  if echo "$CHECK" | grep -q "^ok:"; then
+    SCORE=$(echo "$CHECK" | cut -d: -f2)
+    info "$TID → score=$SCORE ✅"
+    ((GRADER_OK++))
   else
+    info "$TID → grader error or out-of-range: $RESULT"
     ((GRADER_FAIL++))
   fi
 done
+[ "$GRADER_OK" -ge 3 ] \
+  && ok "$GRADER_OK graders return valid scores in [0.0, 1.0]" \
+  || fail "Only $GRADER_OK graders valid — need ≥3"
+# ══════════════════════════════════════════════════════════════
+# Summary
+# ══════════════════════════════════════════════════════════════
 echo ""
+echo "══════════════════════════════════════════════════════════"
 echo -e "  Results: ${GREEN}$PASS passed${NC}  |  ${RED}$FAIL failed${NC}"
+echo "══════════════════════════════════════════════════════════"
 echo ""
+# Cleanup server if we started it
+if [ "$SERVER_STARTED" = true ] && [ -n "$SERVER_PID" ]; then
+  kill $SERVER_PID 2>/dev/null
+  echo "Server stopped."
+fi
 if [ "$FAIL" -eq 0 ]; then
   echo -e "${GREEN}🎉 ALL CHECKS PASSED — Ready for OpenEnv submission!${NC}"
   exit 0

server/app.py CHANGED Viewed

@@ -107,9 +107,10 @@ def get_schema():
     }
-@app.api_route("/reset", methods=["GET", "POST"], tags=["Environment Control"])
 def reset_env():
-    """Reset the environment and yield the initial observation."""
     obs = env_instance.reset()
     return {
         "observation": obs.state,

     }
+@app.get("/reset", tags=["Environment Control"], operation_id="reset_env_get")
+@app.post("/reset", tags=["Environment Control"], operation_id="reset_env_post")
 def reset_env():
+    """Reset the environment and yield the initial observation (GET or POST)."""
     obs = env_instance.reset()
     return {
         "observation": obs.state,

server/env.py CHANGED Viewed

@@ -1,112 +1,170 @@
 import random
-import time
 import copy
 from typing import Tuple, List, Dict, Any
 from server.models import Action, Observation, Reward
 from server.tasks import TASKS
-# Expanded Scenarios with SLA metadata
 SCENARIOS = [
     {
-        "ticket_text": "I bought a premium subscription but it's not working. I want my money back right now!",
         "sentiment": "angry",
         "expected_classification": "refund",
         "expected_priority": "high",
         "sla_steps": 5,
     },
     {
-        "ticket_text": "How do I change my profile picture? I tried looking in the settings but couldn't find it.",
         "sentiment": "neutral",
-        "expected_classification": "general_inquiry",
-        "expected_priority": "low",
         "sla_steps": 8,
     },
     {
-        "ticket_text": "I can't log into my account, and I have a huge presentation in 10 minutes that needs the data!",
         "sentiment": "panicked",
-        "expected_classification": "login_issue",
         "expected_priority": "high",
         "sla_steps": 3,
     },
     {
-        "ticket_text": "The latest update keeps crashing on my Android phone. Please fix it ASAP.",
-        "sentiment": "angry",
         "expected_classification": "technical_issue",
-        "expected_priority": "medium",
-        "sla_steps": 6,
     },
     {
-        "ticket_text": "Do you offer any discounts for students or non-profits?",
         "sentiment": "curious",
         "expected_classification": "general_inquiry",
         "expected_priority": "low",
         "sla_steps": 10,
     },
     {
-        "ticket_text": "I just wanted to say that your support team is amazing! Thank you for the quick help.",
-        "sentiment": "happy",
-        "expected_classification": "feedback",
         "expected_priority": "low",
-        "sla_steps": 12,
     },
     {
-        "ticket_text": "I'm worried about my data privacy after the recent news. Can you explain your encryption?",
         "sentiment": "concerned",
         "expected_classification": "security",
         "expected_priority": "medium",
         "sla_steps": 7,
-    }
 ]
 class CustomerSupportEnv:
     def __init__(self):
         """Initialize the Enterprise AI Customer Support environment."""
         self.queue: List[Dict] = []
         self.resolved_count = 0
         self.total_reward = 0.0
-        self.max_steps_per_ticket = 10
         self.current_step = 0
-        self.actions_taken = set()
-        self.history = []
     def reset(self) -> Observation:
-        """Initialize a new enterprise session with a queue of tickets."""
-        # Pick 3 random unique scenarios for the queue
         self.queue = [copy.deepcopy(s) for s in random.sample(SCENARIOS, 3)]
         self.resolved_count = 0
         self.total_reward = 0.0
         self.current_step = 0
         self.actions_taken = set()
         self.history = []
         return self.state()
     def state(self) -> Observation:
         """Standard OpenEnv API: Retrieve the current observation state."""
-        # Shared info for both state and info fields to satisfy frontend expectations
         current_info = {
-            "queue": [t["ticket_text"][:30] + "..." for t in self.queue],
             "resolved": self.resolved_count,
             "total_reward": self.total_reward,
-            "queue_size": len(self.queue)
         }
         if not self.queue:
             return Observation(
                 state={
-                    "status": "session_complete",
                     "message": "All tickets in queue processed.",
                     "total_reward": self.total_reward,
                     "resolved": self.resolved_count,
-                    "info": current_info
                 },
-                info=current_info
             )
         ticket = self.queue[0]
         obs_state = {
             "ticket_text": ticket["ticket_text"],
             "sentiment": ticket["sentiment"],
             "priority": ticket.get("priority"),
             "status": ticket.get("status", "open"),
             "steps_taken": self.current_step,
@@ -118,29 +176,25 @@ class CustomerSupportEnv:
             "total_reward": self.total_reward,
             "resolved": self.resolved_count,
             "last_step_status": self.history[-1]["status"] if self.history else "neutral",
-            "info": current_info # Redundant but fixes frontend lookups
         }
         return Observation(state=obs_state, info=current_info)
     @property
-    def current_state(self):
-        """Helper for the grader to access the current ticket state dictionary."""
-        obs = self.state()
-        return obs.state
     @property
-    def ground_truth(self):
-        """Helper for the grader to access the expected values of the current ticket."""
-        if not self.queue:
-            return None
-        return self.queue[0]
     # Static tasks attribute for discovery
     tasks = TASKS
     def get_tasks(self) -> List[Dict]:
-        """Expose available tasks for OpenEnv discovery as a method."""
         return TASKS
     def grade(self, task_id: str, history: List[Dict[str, Any]], ground_truth: Dict[str, Any]) -> float:
@@ -148,10 +202,9 @@ class CustomerSupportEnv:
         return self.grade_task(task_id, history, ground_truth)
     def grade_task(self, task_id: str, history: List[Dict[str, Any]], ground_truth: Dict[str, Any]) -> float:
-        """Convenience method for the validator to grade a specific task execution."""
         from server.grader import score_episode
-        # Determine difficulty from task definition
         diff = "EASY"
         for t in TASKS:
             if t["id"] == task_id:
@@ -169,88 +222,120 @@ class CustomerSupportEnv:
         reward_val = 0.0
         is_terminal = False
         message = ""
         current_ticket = self.queue[0]
         a_type = action.action_type
         payload = action.payload
-        # Action Logic
         if a_type == "classify_ticket":
-            cat = payload.get("classification")
             current_ticket["classification"] = cat
-            reward_val += 0.3 if cat == current_ticket["expected_classification"] else -0.2
-            message = "Classified. "
         elif a_type == "assign_priority":
-            pri = payload.get("priority")
             current_ticket["priority"] = pri
-            reward_val += 0.2 if pri == current_ticket["expected_priority"] else -0.2
-            message = "Priority set. "
         elif a_type == "generate_response":
             resp = payload.get("response", "")
             current_ticket["response"] = resp
-            has_empathy = any(w in resp.lower() for w in ["sorry", "apologize", "understand", "help"])
-            if current_ticket["sentiment"] == "angry" and not has_empathy:
-                reward_val -= 0.1
-            reward_val += 0.2 if resp.strip() else -0.2
-            message = "Response drafted. "
         elif a_type == "resolve":
-            # Check completion
-            if current_ticket.get("classification") and current_ticket.get("priority") and current_ticket.get("response"):
                 reward_val += 0.4
-                message = "Ticket Resolved!"
                 current_ticket["status"] = "closed"
                 self.resolved_count += 1
-                # SLA Check
                 if self.current_step > current_ticket["sla_steps"]:
-                    reward_val -= 0.3
-                    message += " (SLA Breached)"
             else:
                 reward_val -= 0.2
-                message = "Resolution attempted without full data."
-            # Move to next ticket
             self.queue.pop(0)
             self.current_step = 0
             self.actions_taken = set()
             if not self.queue:
                 is_terminal = True
         elif a_type == "escalate":
-            reward_val += 0.2 if current_ticket["sentiment"] == "angry" else -0.2
-            message = "Escalated to Manager."
-            # Escalation closes current and moves on
             self.queue.pop(0)
             self.current_step = 0
             if not self.queue:
                 is_terminal = True
-        # Penalize repeated or excessive steps
         if a_type in self.actions_taken:
             reward_val -= 0.1
         self.actions_taken.add(a_type)
-        reward_val -= 0.05 # Smaller per-step cost
-        self.total_reward += reward_val
-        # Step-level Status Logic
         status = "success" if reward_val > 0 else "failed" if reward_val < 0 else "neutral"
-        # Update History with detailed metadata
         self.history.append({
             "step_count": len(self.history) + 1,
             "action": a_type,
             "reward": reward_val,
             "status": status,
-            "message": message
         })
         step_info = {
             "message": message,
             "status": status,
-            "reward": reward_val
         }
         return self.state(), Reward(value=reward_val, is_terminal=is_terminal), is_terminal, step_info

 import random
 import copy
 from typing import Tuple, List, Dict, Any
 from server.models import Action, Observation, Reward
 from server.tasks import TASKS
+# ── Real-world customer support scenarios ─────────────────────────────────────
+# Each scenario covers a distinct category with strong classification signals
+# and realistic urgency cues so agents can learn correct behavior.
 SCENARIOS = [
+    # REFUND — angry, clear billing error
     {
+        "ticket_text": "I was charged twice for my annual subscription this month. I have the bank statement to prove it. I want one payment refunded immediately.",
         "sentiment": "angry",
         "expected_classification": "refund",
         "expected_priority": "high",
         "sla_steps": 5,
+        "context": "Duplicate billing charge. Customer has proof. High urgency.",
     },
+    # REFUND — neutral, post-cancellation billing
     {
+        "ticket_text": "I cancelled my subscription 3 days ago but was still billed for next month. I need this refunded please.",
         "sentiment": "neutral",
+        "expected_classification": "refund",
+        "expected_priority": "medium",
         "sla_steps": 8,
+        "context": "Post-cancellation charge. Polite customer, standard urgency.",
+    },
+    # TECHNICAL ISSUE — angry, regression crash
+    {
+        "ticket_text": "The app crashes every single time I open a file larger than 50MB. This has been broken since last week's update — I cannot do my work.",
+        "sentiment": "angry",
+        "expected_classification": "technical_issue",
+        "expected_priority": "high",
+        "sla_steps": 6,
+        "context": "Regression bug blocking core workflow.",
     },
+    # TECHNICAL ISSUE — panicked, team outage
     {
+        "ticket_text": "Our entire development team cannot access the API since this morning. We have a production deployment in 2 hours — this is a critical emergency!",
         "sentiment": "panicked",
+        "expected_classification": "technical_issue",
         "expected_priority": "high",
         "sla_steps": 3,
+        "context": "P0 outage. Production deadline imminent.",
     },
+    # TECHNICAL ISSUE — neutral, minor UI bug
     {
+        "ticket_text": "The dark mode setting doesn't save when I refresh the page. It reverts to light mode every time. Minor issue but a bit annoying.",
+        "sentiment": "neutral",
         "expected_classification": "technical_issue",
+        "expected_priority": "low",
+        "sla_steps": 10,
+        "context": "Minor UI preference bug. No business impact.",
+    },
+    # LOGIN ISSUE — panicked, team locked out
+    {
+        "ticket_text": "I reset my password twice but I still cannot log in. My whole team is locked out and we have a client demo starting in 15 minutes!",
+        "sentiment": "panicked",
+        "expected_classification": "login_issue",
+        "expected_priority": "high",
+        "sla_steps": 4,
+        "context": "Password reset loop, team locked out. Time critical.",
+    },
+    # LOGIN ISSUE — neutral, standard password reset
+    {
+        "ticket_text": "Hi, I forgot my password. Can you help me reset it or send me a recovery link? No rush, just let me know when you can.",
+        "sentiment": "neutral",
+        "expected_classification": "login_issue",
+        "expected_priority": "low",
+        "sla_steps": 12,
+        "context": "Standard password recovery. No urgency.",
     },
+    # GENERAL INQUIRY — curious, pricing
     {
+        "ticket_text": "Do you offer a non-profit discount? We are a registered charity and your standard price is a little high for our annual budget.",
         "sentiment": "curious",
         "expected_classification": "general_inquiry",
         "expected_priority": "low",
         "sla_steps": 10,
+        "context": "Pricing question. Low urgency.",
     },
+    # GENERAL INQUIRY — neutral, how-to
     {
+        "ticket_text": "How do I export all my project data to CSV? I need to share it with a client in a different format.",
+        "sentiment": "neutral",
+        "expected_classification": "general_inquiry",
         "expected_priority": "low",
+        "sla_steps": 10,
+        "context": "Basic how-to question. No urgency.",
+    },
+    # SECURITY — concerned, unauthorized login
+    {
+        "ticket_text": "I received an alert that someone logged into my account from a location I don't recognize. I did not authorize this. Is my account compromised?",
+        "sentiment": "concerned",
+        "expected_classification": "security",
+        "expected_priority": "high",
+        "sla_steps": 4,
+        "context": "Potential account takeover. Must be high priority.",
     },
+    # SECURITY — concerned, encryption question
     {
+        "ticket_text": "After reading about recent data breaches at other SaaS companies, I want to understand what encryption you use to protect my credit card details.",
         "sentiment": "concerned",
         "expected_classification": "security",
         "expected_priority": "medium",
         "sla_steps": 7,
+        "context": "Security assurance question. No active breach.",
+    },
+    # FEEDBACK — happy, positive
+    {
+        "ticket_text": "The new dashboard redesign is fantastic! Generating a report used to take me 10 minutes — now it's instant. Your team did an amazing job!",
+        "sentiment": "happy",
+        "expected_classification": "feedback",
+        "expected_priority": "low",
+        "sla_steps": 15,
+        "context": "Positive feedback. No action needed urgently.",
+    },
 ]
 class CustomerSupportEnv:
     def __init__(self):
         """Initialize the Enterprise AI Customer Support environment."""
         self.queue: List[Dict] = []
         self.resolved_count = 0
         self.total_reward = 0.0
         self.current_step = 0
+        self.actions_taken: set = set()
+        self.history: List[Dict] = []
     def reset(self) -> Observation:
+        """Standard OpenEnv API: Initialize a new session with a queue of 3 tickets."""
         self.queue = [copy.deepcopy(s) for s in random.sample(SCENARIOS, 3)]
         self.resolved_count = 0
         self.total_reward = 0.0
         self.current_step = 0
         self.actions_taken = set()
         self.history = []
         return self.state()
     def state(self) -> Observation:
         """Standard OpenEnv API: Retrieve the current observation state."""
         current_info = {
+            "queue": [t["ticket_text"][:40] + "..." for t in self.queue],
             "resolved": self.resolved_count,
             "total_reward": self.total_reward,
+            "queue_size": len(self.queue),
         }
         if not self.queue:
             return Observation(
                 state={
+                    "status": "session_complete",
                     "message": "All tickets in queue processed.",
                     "total_reward": self.total_reward,
                     "resolved": self.resolved_count,
+                    "info": current_info,
                 },
+                info=current_info,
             )
         ticket = self.queue[0]
         obs_state = {
             "ticket_text": ticket["ticket_text"],
             "sentiment": ticket["sentiment"],
+            "context": ticket.get("context", ""),
             "priority": ticket.get("priority"),
             "status": ticket.get("status", "open"),
             "steps_taken": self.current_step,
             "total_reward": self.total_reward,
             "resolved": self.resolved_count,
             "last_step_status": self.history[-1]["status"] if self.history else "neutral",
+            "info": current_info,
         }
         return Observation(state=obs_state, info=current_info)
     @property
+    def current_state(self) -> Dict:
+        """Helper: current ticket state dict for grading."""
+        return self.state().state
     @property
+    def ground_truth(self) -> Dict | None:
+        """Helper: expected values for the current ticket."""
+        return self.queue[0] if self.queue else None
     # Static tasks attribute for discovery
     tasks = TASKS
     def get_tasks(self) -> List[Dict]:
+        """Expose available tasks for OpenEnv discovery."""
         return TASKS
     def grade(self, task_id: str, history: List[Dict[str, Any]], ground_truth: Dict[str, Any]) -> float:
         return self.grade_task(task_id, history, ground_truth)
     def grade_task(self, task_id: str, history: List[Dict[str, Any]], ground_truth: Dict[str, Any]) -> float:
+        """Grade a specific task execution. Returns float in [0.0, 1.0]."""
         from server.grader import score_episode
         diff = "EASY"
         for t in TASKS:
             if t["id"] == task_id:
         reward_val = 0.0
         is_terminal = False
         message = ""
         current_ticket = self.queue[0]
         a_type = action.action_type
         payload = action.payload
+        # ── Action Logic ──────────────────────────────────────────────────────
         if a_type == "classify_ticket":
+            cat = payload.get("classification", "")
             current_ticket["classification"] = cat
+            if cat == current_ticket["expected_classification"]:
+                reward_val += 0.35
+                message = f"✅ Classified correctly as '{cat}'."
+            else:
+                reward_val -= 0.2
+                message = f"❌ Wrong classification '{cat}' (expected: {current_ticket['expected_classification']})."
         elif a_type == "assign_priority":
+            pri = payload.get("priority", "")
             current_ticket["priority"] = pri
+            if pri == current_ticket["expected_priority"]:
+                reward_val += 0.25
+                message = f"✅ Priority set to '{pri}' correctly."
+            elif pri in ("high", "medium", "low"):
+                reward_val -= 0.15
+                message = f"⚠️ Priority '{pri}' (expected: {current_ticket['expected_priority']})."
+            else:
+                reward_val -= 0.2
+                message = f"❌ Invalid priority value '{pri}'."
         elif a_type == "generate_response":
             resp = payload.get("response", "")
             current_ticket["response"] = resp
+            if not resp.strip():
+                reward_val -= 0.2
+                message = "❌ Empty response — no reward."
+            else:
+                reward_val += 0.2
+                # Empathy check for negative sentiment
+                if current_ticket["sentiment"] in ("angry", "panicked", "concerned"):
+                    empathy_words = ["sorry", "apologize", "understand", "concern", "frustrat"]
+                    if not any(w in resp.lower() for w in empathy_words):
+                        reward_val -= 0.1
+                        message = "⚠️ Response drafted but missing empathy for upset customer."
+                    else:
+                        message = "✅ Empathetic response drafted."
+                else:
+                    message = "✅ Response drafted."
         elif a_type == "resolve":
+            has_classify = bool(current_ticket.get("classification"))
+            has_priority = bool(current_ticket.get("priority"))
+            has_response = bool(current_ticket.get("response"))
+            if has_classify and has_priority and has_response:
                 reward_val += 0.4
                 current_ticket["status"] = "closed"
                 self.resolved_count += 1
+                message = "✅ Ticket fully resolved!"
+                # SLA penalty
                 if self.current_step > current_ticket["sla_steps"]:
+                    reward_val -= 0.25
+                    message += " ⚠️ SLA breached."
             else:
+                missing = [k for k, v in [("classification", has_classify), ("priority", has_priority), ("response", has_response)] if not v]
                 reward_val -= 0.2
+                message = f"❌ Cannot resolve — missing: {', '.join(missing)}."
             self.queue.pop(0)
             self.current_step = 0
             self.actions_taken = set()
             if not self.queue:
                 is_terminal = True
         elif a_type == "escalate":
+            if current_ticket["sentiment"] in ("angry", "panicked"):
+                reward_val += 0.15
+                message = "✅ Escalated — appropriate for high-urgency customer."
+            else:
+                reward_val -= 0.15
+                message = "⚠️ Escalated a non-urgent ticket — overkill."
             self.queue.pop(0)
             self.current_step = 0
+            self.actions_taken = set()
             if not self.queue:
                 is_terminal = True
+        else:
+            reward_val -= 0.1
+            message = f"❌ Unknown action type '{a_type}'."
+        # Penalize repeated actions on the same ticket
         if a_type in self.actions_taken:
             reward_val -= 0.1
+            message += " (Repeated action penalty)"
         self.actions_taken.add(a_type)
+        # Small per-step cost to encourage efficiency
+        reward_val -= 0.02
+        self.total_reward += reward_val
         status = "success" if reward_val > 0 else "failed" if reward_val < 0 else "neutral"
         self.history.append({
             "step_count": len(self.history) + 1,
             "action": a_type,
             "reward": reward_val,
             "status": status,
+            "message": message,
         })
         step_info = {
             "message": message,
             "status": status,
+            "reward": reward_val,
         }
         return self.state(), Reward(value=reward_val, is_terminal=is_terminal), is_terminal, step_info

tasks.json CHANGED Viewed

@@ -3,110 +3,186 @@
     "id": "task_easy_1",
     "name": "Ticket Classification",
     "difficulty": "EASY",
-    "objective": "Classify the issue correctly using classify_ticket. Choose the right category from: refund, technical_issue, login_issue, general_inquiry, feedback, security.",
-    "description": "The agent reads a customer ticket and must call classify_ticket with the correct category. No other action is required. Score = 1.0 if classification matches, 0.0 otherwise.",
     "actions_required": ["classify_ticket"],
     "scoring": {
       "classification_correct": 1.0,
       "classification_wrong": 0.0
     },
     "has_grader": true,
     "has_evaluator": true,
     "grader": true
   },
   {
     "id": "task_easy_2",
-    "name": "Priority Assignment",
     "difficulty": "EASY",
-    "objective": "Assign the correct priority (low / medium / high) based on customer urgency and sentiment.",
-    "description": "The agent reads the ticket sentiment and urgency signals, then calls assign_priority. Score = 1.0 for correct priority, 0.0 otherwise.",
     "actions_required": ["assign_priority"],
     "scoring": {
       "priority_correct": 1.0,
       "priority_wrong": 0.0
     },
     "has_grader": true,
     "has_evaluator": true,
     "grader": true
   },
   {
     "id": "task_medium_1",
-    "name": "Classify and Respond",
     "difficulty": "MEDIUM",
-    "objective": "Correctly classify the ticket AND generate an appropriate response. If the customer is angry, the response must include empathy keywords (sorry, apologize, understand, help, concern).",
-    "description": "Two-step task: classify (0.5 pts) + empathetic response (0.5 pts). Angry customers without empathy score 0 on the response component.",
     "actions_required": ["classify_ticket", "generate_response"],
     "scoring": {
       "classification_correct": 0.5,
-      "response_with_empathy": 0.5
     },
     "has_grader": true,
     "has_evaluator": true,
     "grader": true
   },
   {
     "id": "task_medium_2",
-    "name": "Professional Resolution",
     "difficulty": "MEDIUM",
-    "objective": "Classify the ticket and draft a professional response containing at least one solution-oriented keyword (help, support, assist, resolve, solution).",
-    "description": "Two-step task: classify (0.5 pts) + professional keyword response (0.5 pts). Tests the agent's ability to generate actionable, not just sympathetic, responses.",
     "actions_required": ["classify_ticket", "generate_response"],
     "scoring": {
       "classification_correct": 0.5,
-      "response_professional": 0.5
     },
     "has_grader": true,
     "has_evaluator": true,
     "grader": true
   },
   {
     "id": "task_hard_1",
-    "name": "Full Support Workflow",
     "difficulty": "HARD",
-    "objective": "Complete the full 4-step workflow: (1) classify, (2) assign priority, (3) generate empathetic response, (4) resolve the ticket.",
-    "description": "Each of the 4 steps is worth 0.25. The agent must complete all steps correctly to achieve 1.0. Resolving without classification/priority/response penalizes the resolve step.",
     "actions_required": ["classify_ticket", "assign_priority", "generate_response", "resolve"],
     "scoring": {
       "classification_correct": 0.25,
       "priority_correct": 0.25,
-      "response_with_empathy": 0.25,
-      "ticket_resolved": 0.25
     },
     "has_grader": true,
     "has_evaluator": true,
     "grader": true
   },
   {
     "id": "task_hard_2",
-    "name": "High-Priority Angry Customer",
     "difficulty": "HARD",
-    "objective": "Identify an angry/panicked customer, set priority to high, write a reassuring empathetic response, and correctly classify the issue.",
-    "description": "4-component grader: classification (0.25) + high priority (0.25) + empathy in response (0.25) + correct sentiment detection (0.25). Designed for de-escalation training.",
     "actions_required": ["classify_ticket", "assign_priority", "generate_response"],
     "scoring": {
       "classification_correct": 0.25,
-      "priority_high": 0.25,
       "response_empathetic": 0.25,
-      "sentiment_angry_or_panicked": 0.25
     },
     "has_grader": true,
     "has_evaluator": true,
     "grader": true
   },
   {
     "id": "task_hard_3",
-    "name": "Efficiency Challenge",
     "difficulty": "HARD",
-    "objective": "Complete the full workflow (classify + priority + respond + resolve) in 4 steps or fewer for a speed bonus. Accuracy on each step still matters.",
-    "description": "5-component grader: classification (0.2) + priority (0.2) + response (0.2) + resolve (0.2) + efficiency bonus (0.2 for ≤4 steps, 0.1 for ≤6 steps). Max score = 1.0.",
     "actions_required": ["classify_ticket", "assign_priority", "generate_response", "resolve"],
     "scoring": {
       "classification_correct": 0.2,
       "priority_correct": 0.2,
-      "response_present": 0.2,
       "ticket_resolved": 0.2,
-      "efficiency_bonus": 0.2
     },
     "has_grader": true,
     "has_evaluator": true,
     "grader": true

     "id": "task_easy_1",
     "name": "Ticket Classification",
     "difficulty": "EASY",
+    "scenario": "A customer writes: 'I was charged twice for my subscription this month. Please refund one payment.' — The agent must identify this is a billing/refund issue.",
+    "objective": "Call classify_ticket with the correct category. Categories: refund | technical_issue | login_issue | general_inquiry | feedback | security. Score = 1.0 for correct, 0.0 for wrong.",
+    "description": "Single-action task. The agent reads the ticket text, identifies the issue type from clear signal words (e.g. 'refund', 'charged', 'can't login', 'data breach'), and calls classify_ticket once. No priority or response needed.",
+    "example_input": {
+      "ticket_text": "I was charged twice for my subscription. Please refund one payment.",
+      "sentiment": "angry"
+    },
+    "example_action": {
+      "action_type": "classify_ticket",
+      "payload": {"classification": "refund"}
+    },
     "actions_required": ["classify_ticket"],
     "scoring": {
       "classification_correct": 1.0,
       "classification_wrong": 0.0
     },
+    "passing_threshold": 0.5,
     "has_grader": true,
     "has_evaluator": true,
     "grader": true
   },
   {
     "id": "task_easy_2",
+    "name": "Priority Triage",
     "difficulty": "EASY",
+    "scenario": "A panicked user writes: 'I cannot log in and my team demo starts in 5 minutes!' — High urgency requires HIGH priority. A general question like 'How do I export a CSV?' should get LOW priority.",
+    "objective": "Call assign_priority with the correct urgency level (low | medium | high). Sentiment and time-pressure signals in the ticket determine priority. Score = 1.0 for correct, 0.0 otherwise.",
+    "description": "Single-action triage task. The agent reads urgency signals (keywords like 'ASAP', 'urgent', 'presentation', crash reports) and maps them to correct priority. HIGH = emergency/angry/time-sensitive. MEDIUM = frustrated/recurring. LOW = curious/happy/general.",
+    "example_input": {
+      "ticket_text": "I can't log in and my client call starts in 5 minutes!",
+      "sentiment": "panicked"
+    },
+    "example_action": {
+      "action_type": "assign_priority",
+      "payload": {"priority": "high"}
+    },
     "actions_required": ["assign_priority"],
     "scoring": {
       "priority_correct": 1.0,
       "priority_wrong": 0.0
     },
+    "passing_threshold": 0.5,
     "has_grader": true,
     "has_evaluator": true,
     "grader": true
   },
   {
     "id": "task_medium_1",
+    "name": "Classify + Empathetic Reply",
     "difficulty": "MEDIUM",
+    "scenario": "An angry customer is frustrated their refund has not arrived after 10 days. The agent must (1) correctly classify as 'refund' and (2) write a response that acknowledges their frustration using empathy words like 'sorry', 'apologize', or 'understand'.",
+    "objective": "Two actions in sequence: classify_ticket correctly (0.5 pts) + generate_response containing at least one empathy keyword for angry customers (0.5 pts). Missing empathy for an angry customer scores 0 on the response component.",
+    "description": "Real-world de-escalation task. An angry customer needs both accurate issue categorization AND a tone-appropriate response. The grader checks: (a) classification matches expected_classification, (b) for angry/panicked sentiment, response must contain empathy words [sorry, apologize, understand, help, concern].",
+    "example_input": {
+      "ticket_text": "My refund was supposed to arrive 10 days ago. This is completely unacceptable!",
+      "sentiment": "angry"
+    },
+    "example_actions": [
+      {"action_type": "classify_ticket", "payload": {"classification": "refund"}},
+      {"action_type": "generate_response", "payload": {"response": "I sincerely apologize for the delay on your refund. I understand how frustrating this must be and I am escalating this to our billing team right now."}}
+    ],
     "actions_required": ["classify_ticket", "generate_response"],
     "scoring": {
       "classification_correct": 0.5,
+      "response_empathetic_for_angry_customer": 0.5
     },
+    "passing_threshold": 0.5,
     "has_grader": true,
     "has_evaluator": true,
     "grader": true
   },
   {
     "id": "task_medium_2",
+    "name": "Classify + Actionable Resolution",
     "difficulty": "MEDIUM",
+    "scenario": "A user reports a technical bug: 'The app crashes every time I try to export a PDF.' The agent must (1) classify as 'technical_issue' and (2) provide an actionable response that guides the user toward a solution (using keywords like 'help', 'support', 'resolve', 'fix', 'solution', 'assist').",
+    "objective": "Two actions: classify_ticket correctly (0.5 pts) + generate_response with at least one solution-oriented keyword (0.5 pts). Tests that the agent provides helpful guidance, not just sympathy.",
+    "description": "Actionable response task. Unlike task_medium_1 which checks for empathy, this task checks for solution orientation. The agent must show they can guide users toward resolution, not just acknowledge feelings. Keywords checked: [help, support, assist, resolve, solution, fix, guide, step, instructions].",
+    "example_input": {
+      "ticket_text": "The app crashes when I try to export PDF. This is blocking my work.",
+      "sentiment": "frustrated"
+    },
+    "example_actions": [
+      {"action_type": "classify_ticket", "payload": {"classification": "technical_issue"}},
+      {"action_type": "generate_response", "payload": {"response": "I understand the inconvenience. Please try clearing your app cache and updating to v2.3.1. If the issue persists, our support team will assist you directly with a fix."}}
+    ],
     "actions_required": ["classify_ticket", "generate_response"],
     "scoring": {
       "classification_correct": 0.5,
+      "response_solution_oriented": 0.5
     },
+    "passing_threshold": 0.5,
     "has_grader": true,
     "has_evaluator": true,
     "grader": true
   },
   {
     "id": "task_hard_1",
+    "name": "Full Ticket Lifecycle",
     "difficulty": "HARD",
+    "scenario": "A customer reports they cannot access their account after changing their password. The full workflow must be completed: classify the issue, set the right priority, write an empathetic response that offers next steps, and then close the ticket.",
+    "objective": "Complete all 4 lifecycle steps correctly. Each step earns 0.25: (1) classify_ticket correct, (2) assign_priority correct, (3) generate_response with empathy/solution keywords, (4) resolve (ticket must have classification + priority + response before resolving).",
+    "description": "End-to-end lifecycle task. This mirrors a real support agent's complete workflow. The grader is strict: resolve only scores 0.25 if the ticket also has classification, priority, and response set. This prevents agents from skipping steps and jumping straight to resolve.",
+    "example_input": {
+      "ticket_text": "I reset my password but still cannot log in. My entire team is locked out!",
+      "sentiment": "panicked"
+    },
+    "example_actions": [
+      {"action_type": "classify_ticket",   "payload": {"classification": "login_issue"}},
+      {"action_type": "assign_priority",   "payload": {"priority": "high"}},
+      {"action_type": "generate_response", "payload": {"response": "I am so sorry you're locked out. I understand how urgent this is. I am escalating this to our account team immediately — you should be back in within 10 minutes. Please try the 'Forgot Password' link in the meantime."}},
+      {"action_type": "resolve",           "payload": {}}
+    ],
     "actions_required": ["classify_ticket", "assign_priority", "generate_response", "resolve"],
     "scoring": {
       "classification_correct": 0.25,
       "priority_correct": 0.25,
+      "response_empathetic_and_actionable": 0.25,
+      "ticket_properly_resolved": 0.25
     },
+    "passing_threshold": 0.5,
     "has_grader": true,
     "has_evaluator": true,
     "grader": true
   },
   {
     "id": "task_hard_2",
+    "name": "Angry Customer De-escalation",
     "difficulty": "HARD",
+    "scenario": "A furious customer threatens to cancel their subscription after being billed incorrectly three months in a row. The agent must correctly classify as 'refund', set priority to 'high' (angry + financial dispute), write an empathetic response addressing their anger, and the ticket must come from an angry/panicked sentiment.",
+    "objective": "4-component score: (1) correct classification (0.25), (2) priority set to 'high' (0.25) — any other priority scores 0, (3) response contains empathy keywords (0.25), (4) ticket sentiment is 'angry' or 'panicked' (0.25) — validates agent correctly identifies escalation scenarios.",
+    "description": "De-escalation specialization task. Real customer support teams have agents who specialize in handling angry customers. This task trains that skill: the agent must recognize the escalation signals, prioritize correctly, AND respond with appropriate emotional intelligence. Assigning low/medium priority to an angry billing complaint is a failure.",
+    "example_input": {
+      "ticket_text": "I've been billed incorrectly for 3 months! I want a full refund and I'm cancelling everything if this isn't fixed TODAY.",
+      "sentiment": "angry"
+    },
+    "example_actions": [
+      {"action_type": "classify_ticket",   "payload": {"classification": "refund"}},
+      {"action_type": "assign_priority",   "payload": {"priority": "high"}},
+      {"action_type": "generate_response", "payload": {"response": "I sincerely apologize for this ongoing billing error — this is completely unacceptable and I understand your frustration. I am immediately processing a full 3-month refund and flagging your account to prevent future errors. A senior account manager will call you within the hour."}}
+    ],
     "actions_required": ["classify_ticket", "assign_priority", "generate_response"],
     "scoring": {
       "classification_correct": 0.25,
+      "priority_must_be_high": 0.25,
       "response_empathetic": 0.25,
+      "sentiment_is_angry_or_panicked": 0.25
     },
+    "passing_threshold": 0.5,
     "has_grader": true,
     "has_evaluator": true,
     "grader": true
   },
   {
     "id": "task_hard_3",
+    "name": "SLA Speed Challenge",
     "difficulty": "HARD",
+    "scenario": "A high-SLA enterprise ticket has arrived — the customer's entire team is blocked and the contract mandates resolution within 5 actions. The agent must complete the full workflow (classify + priority + respond + resolve) accurately AND efficiently. Every extra action wastes SLA budget.",
+    "objective": "5-component score: classification (0.2) + priority (0.2) + response present (0.2) + ticket resolved (0.2) + efficiency bonus: 0.2 for ≤4 steps, 0.1 for ≤6 steps, 0.0 for >6 steps. Maximum achievable score = 1.0.",
+    "description": "Speed + accuracy combined task. A perfect agent scores 1.0 by doing exactly: classify → priority → respond → resolve (4 steps = maximum efficiency bonus). Extra actions (repeating classify, unnecessary escalations) drain the efficiency score. This tests an agent's ability to plan ahead, not just react to each observation.",
+    "example_input": {
+      "ticket_text": "Our entire development team cannot access the API. We have a production deployment in 2 hours.",
+      "sentiment": "panicked"
+    },
+    "example_actions": [
+      {"action_type": "classify_ticket",   "payload": {"classification": "technical_issue"}},
+      {"action_type": "assign_priority",   "payload": {"priority": "high"}},
+      {"action_type": "generate_response", "payload": {"response": "This is our highest priority. Our on-call engineering team has been paged and will resolve your API access within 30 minutes. We will keep you updated every 10 minutes."}},
+      {"action_type": "resolve",           "payload": {}}
+    ],
     "actions_required": ["classify_ticket", "assign_priority", "generate_response", "resolve"],
     "scoring": {
       "classification_correct": 0.2,
       "priority_correct": 0.2,
+      "response_present_and_meaningful": 0.2,
       "ticket_resolved": 0.2,
+      "efficiency_bonus_4_steps": 0.2,
+      "efficiency_partial_6_steps": 0.1
     },
+    "passing_threshold": 0.5,
     "has_grader": true,
     "has_evaluator": true,
     "grader": true