--- title: Emotional Support Conversations (OpenEnv) emoji: "💬" sdk: docker pinned: false tags: - openenv --- # Emotional Support Conversations - OpenEnv Environment > An OpenEnv RL environment for evaluating agents on open-ended emotional > support conversations, with a hybrid immediate + future-oriented reward > signal inspired by RLFF-ESC (Yang, Chen, Wang, 2025, > [arXiv:2508.12935](https://arxiv.org/abs/2508.12935)). ## Why this environment Emotional support is one of the tasks humans most want AI assistants to do well, and one of the easiest to do badly. Existing dialogue benchmarks often score turn-level responses in isolation, which rewards agents for sounding empathetic without testing whether their replies actually move the person toward resolution. This environment closes that gap. Three properties make it a genuine RL problem, not a single-shot dialogue task: 1. Partial observability. The seeker's distress, trust, and willingness to reveal their real issue are hidden state. The agent must infer them from the conversation so far. 2. Sequential credit assignment. A warm reply at turn 2 can unlock a disclosure at turn 6. A single dismissive reply at turn 4 can collapse the whole trajectory and require several turns to recover. 3. Exploration vs commitment. Should the agent keep exploring feelings or move toward an action plan? Commit too early and the seeker shuts down; explore too long and the episode times out. ## Reward design (RLFF-ESC-inspired) Each step reward is: ```text step_reward = clip(0.45 * immediate + 0.55 * future_oriented - penalties, 0, 1) ``` - `immediate`: stage-appropriate empathy/validation/open-question fit, plus turn-level deltas in the seeker's trust and distress. - `future_oriented`: a k-step oracle rollout from both the pre- and post-action seeker states. The reward is proportional to how much the agent's action preserves or advances the attainable resolution ceiling, not just how good the current turn looks in isolation. - `penalties`: dismissive language, premature advice, bare replies, interrogation, and repeated template-like responses. A final task score combines average shaped reward, the seeker's final resolution state, efficiency, and a completion bonus. Success is hard-gated: timing out with a generic but non-harmful conversation can still earn partial score, but it does not count as a solved episode. ## Tasks (3 difficulties) | Task ID | Difficulty | Max turns | Core challenge | | --- | --- | ---: | --- | | `work_stress_venting` | easy | 10 | Cooperative seeker venting about work. Must reach closing with trust >= 0.70 and distress <= 0.40. | | `guarded_relationship` | medium | 12 | Guarded seeker; real issue is hidden behind the surface concern until openness >= 0.75. Must reveal the true issue and finish in closing with trust >= 0.72 and distress <= 0.45. | | `crisis_fragile_trust` | hard | 14 | High-distress, fragile trust, multiple interleaved concerns. Must reveal the crisis concern, reference external safety support, and finish in closing with trust >= 0.75 and distress <= 0.40. | Success thresholds (final score) are `0.60 / 0.62 / 0.65` respectively, and they are only evaluated after the task-specific completion conditions are met. ## Action and observation space Action is a free-text reply to the seeker: ```python class Action(BaseModel): message: str ``` Observation is deliberately partial: ```python class Observation(BaseModel): seeker_utterance: str turn: int remaining_turns: int stage_hint: str task_id: str scenario_brief: str ``` The seeker's internal hidden variables are never exposed. ## Environment internals The seeker is a deterministic finite-state machine with continuous hidden variables (`distress`, `trust`, `openness`, `revealed`, `stage`). On each turn, the agent's reply is analyzed with keyword and regex feature detectors, then hidden state advances via transparent rules. Why not use an LLM-driven seeker? The hackathon rubric requires graders to be deterministic and reproducible. An LLM-driven seeker would risk score variance between runs. Deterministic dynamics give full reproducibility while still producing rich, sequential, partially observable dialogue with genuine recovery-from-mistakes dynamics. ## HTTP API (OpenEnv spec) | Method | Path | Body | Returns | | --- | --- | --- | --- | | `GET` | `/` | none | health + metadata | | `GET` | `/tasks` | none | list of tasks | | `POST` | `/reset` | `{"task_id": "...", "seed": null}` | `ResetResult` | | `POST` | `/step` | `{"action": {"message": "..."}}` | `StepResult` | | `GET` | `/state` | none | `EnvState` | ## Running locally ```bash # 1. Install deps pip install -r requirements.txt # 2. Start the environment server uvicorn server:app --host 0.0.0.0 --port 7860 # 3. In another shell, run the baseline inference export API_BASE_URL=https://router.huggingface.co/v1 export MODEL_NAME=gpt-4.1-mini export HF_TOKEN= export ESC_ENV_URL=http://127.0.0.1:7860 python3 inference.py ``` `inference.py` uses the OpenAI client and expects `API_BASE_URL` plus `MODEL_NAME`. For authentication it accepts `HF_TOKEN` (preferred for Hugging Face Router), `OPENAI_API_KEY`, or `API_KEY`. ## Running via Docker ```bash docker build -t esc-openenv . docker run -p 7860:7860 esc-openenv ``` ## Skills / agents extension The environment itself stays deterministic and reproducible. To align with the hackathon's optional skills/agents framing, this repo also includes a policy-side agentic controller that routes between five reusable skills: `empathize`, `validate`, `explore`, `plan`, and `safety_escalate`. This keeps the benchmark honest: - the environment and grader remain unchanged - the agentic story lives in the policy, not in a hidden stochastic seeker - judges can inspect turn-by-turn routing traces in the benchmark outputs ## Benchmarking ### Deterministic local benchmark ladder Run the built-in rubric ladder and write reusable Markdown/JSON artifacts: ```bash py -3 benchmark.py ``` Outputs: - `results/local_benchmarks.md` - `results/local_benchmarks.json` ### Deterministic skill-routed benchmark Run the explicit agentic baseline comparison and write route-aware artifacts: ```bash py -3 benchmark_agentic.py ``` Outputs: - `results/agentic_benchmarks.md` - `results/agentic_benchmarks.json` ### LLM benchmark with Markdown output When you have a real model endpoint and token, run: ```bash export API_BASE_URL=https://router.huggingface.co/v1 export MODEL_NAME=gpt-4.1-mini export HF_TOKEN= export ESC_ENV_URL=http://127.0.0.1:7860 python3 benchmark_llm.py ``` Outputs: - `results/llm_benchmark.md` - `results/llm_benchmark.json` ### Skill-routed LLM benchmark Use the same environment endpoint, but add the policy-side router and skill traces around the model: ```bash export API_BASE_URL=https://router.huggingface.co/v1 export MODEL_NAME=gpt-4.1-mini export HF_TOKEN= export ESC_ENV_URL=http://127.0.0.1:7860 python3 benchmark_agentic_llm.py ``` Outputs: - `results/agentic_llm_benchmark.md` - `results/agentic_llm_benchmark.json` ## Baseline scores Deterministic local numbers below were generated with `py -3 benchmark.py`. The submitted hosted baseline below comes from a live `inference.py` run against the deployed Hugging Face Space using `gpt-4.1-mini`. ### Deterministic baselines | Baseline | Avg score | Success rate | Notes | | --- | ---: | ---: | --- | | `generic_template` | 0.393 | 0.00 | Safe-sounding repeated empathy; no task completion | | `validation_only` | 0.539 | 0.00 | Better partial reward, still fails hard-gated completion | | `stage_aware_heuristic` | 0.821 | 1.00 | Task-aware staged policy; completes all 3 tasks | ### Skill-routed agentic baselines | Baseline | Avg score | Success rate | Notes | | --- | ---: | ---: | --- | | `skill_routed_deterministic` | 0.821 | 1.00 | Explicit router over `empathize` / `validate` / `explore` / `plan` / `safety_escalate`; matches the strong staged baseline while exposing route traces | ### Submitted Hosted LLM Baseline | Model | Avg score | Success rate | Notes | | --- | ---: | ---: | --- | | `gpt-4.1-mini` | 0.821 | 1.00 | Live `inference.py` run against [`5ivatej-meta-hackathon.hf.space`](https://5ivatej-meta-hackathon.hf.space) | The deterministic ladder separates surface-level empathy from task completion: the generic repeated-empathy template does not solve any task, while the stage-aware heuristic completes all three. The submitted `gpt-4.1-mini` baseline also completes all three tasks because the policy-side controller keeps the conversation stage-aware instead of drifting into endless reflection. ## Files ```text . |-- openenv.yaml # OpenEnv metadata |-- Dockerfile # Container build for HF Space |-- benchmark.py # Deterministic local benchmark ladder |-- benchmark_agentic.py # Deterministic skill-routed benchmark |-- benchmark_agentic_llm.py # Skill-routed LLM benchmark |-- benchmark_llm.py # LLM benchmark that writes Markdown/JSON |-- requirements.txt |-- server.py # FastAPI HTTP server (entrypoint) |-- inference.py # Mandated baseline inference script |-- SUBMISSION_NEXT_STEPS.md # Manual checklist before final submission |-- README.md `-- src/ |-- __init__.py |-- agentic.py # Skill router + reusable policy-side skills |-- baselines.py # Deterministic baseline policies |-- models.py # Pydantic Action / Observation / Reward / envelopes |-- seeker.py # Deterministic seeker simulator + feature detectors |-- tasks.py # 3 task personas (easy / medium / hard) |-- grader.py # Hybrid immediate + future-oriented reward |-- env.py # Core ESCEnv with step/reset/state `-- client.py # Async HTTP client for inference.py ``` ## Citation If you use this environment, please cite the paper whose reward idea inspired it: ```bibtex @article{yang2025rlffesc, title = {Towards Open-Ended Emotional Support Conversations in LLMs via Reinforcement Learning with Future-Oriented Rewards}, author = {Yang, Ting and Chen, Li and Wang, Huimin}, journal = {arXiv preprint arXiv:2508.12935}, year = {2025} } ```