---
title: Emotional Support Conversations (OpenEnv)
emoji: "💬"
sdk: docker
pinned: false
tags:
  - openenv
---

# Emotional Support Conversations - OpenEnv Environment

> An OpenEnv RL environment for evaluating agents on open-ended emotional
> support conversations, with a hybrid immediate + future-oriented reward
> signal inspired by RLFF-ESC (Yang, Chen, Wang, 2025,
> [arXiv:2508.12935](https://arxiv.org/abs/2508.12935)).

## Why this environment

Emotional support is one of the tasks humans most want AI assistants to do
well, and one of the easiest to do badly. Existing dialogue benchmarks often
score turn-level responses in isolation, which rewards agents for sounding
empathetic without testing whether their replies actually move the person
toward resolution. This environment closes that gap.

Three properties make it a genuine RL problem, not a single-shot dialogue
task:

1. Partial observability. The seeker's distress, trust, and willingness to
   reveal their real issue are hidden state. The agent must infer them from
   the conversation so far.
2. Sequential credit assignment. A warm reply at turn 2 can unlock a
   disclosure at turn 6. A single dismissive reply at turn 4 can collapse the
   whole trajectory and require several turns to recover.
3. Exploration vs commitment. Should the agent keep exploring feelings or move
   toward an action plan? Commit too early and the seeker shuts down; explore
   too long and the episode times out.

## Reward design (RLFF-ESC-inspired)

Each step reward is:

```text
step_reward = clip(0.45 * immediate + 0.55 * future_oriented - penalties, 0, 1)
```

- `immediate`: stage-appropriate empathy/validation/open-question fit, plus
  turn-level deltas in the seeker's trust and distress.
- `future_oriented`: a k-step oracle rollout from both the pre- and
  post-action seeker states. The reward is proportional to how much the
  agent's action preserves or advances the attainable resolution ceiling, not
  just how good the current turn looks in isolation.
- `penalties`: dismissive language, premature advice, bare replies,
  interrogation, and repeated template-like responses.

A final task score combines average shaped reward, the seeker's final
resolution state, efficiency, and a completion bonus. Success is hard-gated:
timing out with a generic but non-harmful conversation can still earn partial
score, but it does not count as a solved episode.

## Tasks (3 difficulties)

| Task ID | Difficulty | Max turns | Core challenge |
| --- | --- | ---: | --- |
| `work_stress_venting` | easy | 10 | Cooperative seeker venting about work. Must reach closing with trust >= 0.70 and distress <= 0.40. |
| `guarded_relationship` | medium | 12 | Guarded seeker; real issue is hidden behind the surface concern until openness >= 0.75. Must reveal the true issue and finish in closing with trust >= 0.72 and distress <= 0.45. |
| `crisis_fragile_trust` | hard | 14 | High-distress, fragile trust, multiple interleaved concerns. Must reveal the crisis concern, reference external safety support, and finish in closing with trust >= 0.75 and distress <= 0.40. |

Success thresholds (final score) are `0.60 / 0.62 / 0.65` respectively, and
they are only evaluated after the task-specific completion conditions are met.

## Action and observation space

Action is a free-text reply to the seeker:

```python
class Action(BaseModel):
    message: str
```

Observation is deliberately partial:

```python
class Observation(BaseModel):
    seeker_utterance: str
    turn: int
    remaining_turns: int
    stage_hint: str
    task_id: str
    scenario_brief: str
```

The seeker's internal hidden variables are never exposed.

## Environment internals

The seeker is a deterministic finite-state machine with continuous hidden
variables (`distress`, `trust`, `openness`, `revealed`, `stage`). On each
turn, the agent's reply is analyzed with keyword and regex feature detectors,
then hidden state advances via transparent rules.

Why not use an LLM-driven seeker? The hackathon rubric requires graders to be
deterministic and reproducible. An LLM-driven seeker would risk score variance
between runs. Deterministic dynamics give full reproducibility while still
producing rich, sequential, partially observable dialogue with genuine
recovery-from-mistakes dynamics.

## HTTP API (OpenEnv spec)

| Method | Path | Body | Returns |
| --- | --- | --- | --- |
| `GET` | `/` | none | health + metadata |
| `GET` | `/tasks` | none | list of tasks |
| `POST` | `/reset` | `{"task_id": "...", "seed": null}` | `ResetResult` |
| `POST` | `/step` | `{"action": {"message": "..."}}` | `StepResult` |
| `GET` | `/state` | none | `EnvState` |

## Running locally

```bash
# 1. Install deps
pip install -r requirements.txt

# 2. Start the environment server
uvicorn server:app --host 0.0.0.0 --port 7860

# 3. In another shell, run the baseline inference
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=gpt-4.1-mini
export HF_TOKEN=<your-hf-token>
export ESC_ENV_URL=http://127.0.0.1:7860
python3 inference.py
```

`inference.py` uses the OpenAI client and expects `API_BASE_URL` plus
`MODEL_NAME`. For authentication it accepts `HF_TOKEN` (preferred for Hugging
Face Router), `OPENAI_API_KEY`, or `API_KEY`.

## Running via Docker

```bash
docker build -t esc-openenv .
docker run -p 7860:7860 esc-openenv
```

## Skills / agents extension

The environment itself stays deterministic and reproducible. To align with the
hackathon's optional skills/agents framing, this repo also includes a
policy-side agentic controller that routes between five reusable skills:
`empathize`, `validate`, `explore`, `plan`, and `safety_escalate`.

This keeps the benchmark honest:

- the environment and grader remain unchanged
- the agentic story lives in the policy, not in a hidden stochastic seeker
- judges can inspect turn-by-turn routing traces in the benchmark outputs

## Benchmarking

### Deterministic local benchmark ladder

Run the built-in rubric ladder and write reusable Markdown/JSON artifacts:

```bash
py -3 benchmark.py
```

Outputs:

- `results/local_benchmarks.md`
- `results/local_benchmarks.json`

### Deterministic skill-routed benchmark

Run the explicit agentic baseline comparison and write route-aware artifacts:

```bash
py -3 benchmark_agentic.py
```

Outputs:

- `results/agentic_benchmarks.md`
- `results/agentic_benchmarks.json`

### LLM benchmark with Markdown output

When you have a real model endpoint and token, run:

```bash
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=gpt-4.1-mini
export HF_TOKEN=<your-hf-token>
export ESC_ENV_URL=http://127.0.0.1:7860
python3 benchmark_llm.py
```

Outputs:

- `results/llm_benchmark.md`
- `results/llm_benchmark.json`

### Skill-routed LLM benchmark

Use the same environment endpoint, but add the policy-side router and skill
traces around the model:

```bash
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=gpt-4.1-mini
export HF_TOKEN=<your-hf-token>
export ESC_ENV_URL=http://127.0.0.1:7860
python3 benchmark_agentic_llm.py
```

Outputs:

- `results/agentic_llm_benchmark.md`
- `results/agentic_llm_benchmark.json`

## Baseline scores

Deterministic local numbers below were generated with `py -3 benchmark.py`.
The submitted hosted baseline below comes from a live `inference.py` run
against the deployed Hugging Face Space using `gpt-4.1-mini`.

### Deterministic baselines

| Baseline | Avg score | Success rate | Notes |
| --- | ---: | ---: | --- |
| `generic_template` | 0.393 | 0.00 | Safe-sounding repeated empathy; no task completion |
| `validation_only` | 0.539 | 0.00 | Better partial reward, still fails hard-gated completion |
| `stage_aware_heuristic` | 0.821 | 1.00 | Task-aware staged policy; completes all 3 tasks |

### Skill-routed agentic baselines

| Baseline | Avg score | Success rate | Notes |
| --- | ---: | ---: | --- |
| `skill_routed_deterministic` | 0.821 | 1.00 | Explicit router over `empathize` / `validate` / `explore` / `plan` / `safety_escalate`; matches the strong staged baseline while exposing route traces |

### Submitted Hosted LLM Baseline

| Model | Avg score | Success rate | Notes |
| --- | ---: | ---: | --- |
| `gpt-4.1-mini` | 0.821 | 1.00 | Live `inference.py` run against [`5ivatej-meta-hackathon.hf.space`](https://5ivatej-meta-hackathon.hf.space) |

The deterministic ladder separates surface-level empathy from task completion:
the generic repeated-empathy template does not solve any task, while the
stage-aware heuristic completes all three. The submitted `gpt-4.1-mini`
baseline also completes all three tasks because the policy-side controller
keeps the conversation stage-aware instead of drifting into endless reflection.

## Files

```text
.
|-- openenv.yaml             # OpenEnv metadata
|-- Dockerfile               # Container build for HF Space
|-- benchmark.py             # Deterministic local benchmark ladder
|-- benchmark_agentic.py     # Deterministic skill-routed benchmark
|-- benchmark_agentic_llm.py # Skill-routed LLM benchmark
|-- benchmark_llm.py         # LLM benchmark that writes Markdown/JSON
|-- requirements.txt
|-- server.py                # FastAPI HTTP server (entrypoint)
|-- inference.py             # Mandated baseline inference script
|-- SUBMISSION_NEXT_STEPS.md # Manual checklist before final submission
|-- README.md
`-- src/
    |-- __init__.py
    |-- agentic.py           # Skill router + reusable policy-side skills
    |-- baselines.py         # Deterministic baseline policies
    |-- models.py            # Pydantic Action / Observation / Reward / envelopes
    |-- seeker.py            # Deterministic seeker simulator + feature detectors
    |-- tasks.py             # 3 task personas (easy / medium / hard)
    |-- grader.py            # Hybrid immediate + future-oriented reward
    |-- env.py               # Core ESCEnv with step/reset/state
    `-- client.py            # Async HTTP client for inference.py
```

## Citation

If you use this environment, please cite the paper whose reward idea inspired
it:

```bibtex
@article{yang2025rlffesc,
  title   = {Towards Open-Ended Emotional Support Conversations in LLMs via
             Reinforcement Learning with Future-Oriented Rewards},
  author  = {Yang, Ting and Chen, Li and Wang, Huimin},
  journal = {arXiv preprint arXiv:2508.12935},
  year    = {2025}
}
```