Initial Commit

Browse files

Files changed (11) hide show

.env.example +3 -0
.github/workflows/openenv-validation.yml +37 -0
.gitignore +18 -0
Dockerfile +18 -0
PRD.md +25 -0
README.md +41 -0
inference.py +127 -0
openenv.yaml +29 -0
pyproject.toml +21 -0
requirements.txt +5 -0
server/app.py +53 -0

.env.example ADDED Viewed

	@@ -0,0 +1,3 @@

+API_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
+MODEL_NAME="gemini-2.5-flash"
+OPENAI_API_KEY="YOUR_KEY_HERE"

.github/workflows/openenv-validation.yml ADDED Viewed

	@@ -0,0 +1,37 @@

+name: OpenEnv Validation CI
+on:
+  push:
+    branches: [ "main", "master" ]
+  pull_request:
+    branches: [ "main", "master" ]
+jobs:
+  validate:
+    runs-on: ubuntu-latest
+    steps:
+    - name: Checkout Repository
+      uses: actions/checkout@v4
+    - name: Set up Python
+      uses: actions/setup-python@v4
+      with:
+        python-version: '3.11'
+    - name: Install dependencies and uv
+      run: |
+        python -m pip install --upgrade pip
+        pip install uv
+        pip install openenv-core>=0.2.0
+    - name: Lock dependencies
+      run: uv lock
+    - name: Run OpenEnv Validate
+      run: |
+        openenv validate .
+    - name: Verify Docker Builds
+      run: |
+        docker build -t test-openenv .

.gitignore ADDED Viewed

	@@ -0,0 +1,18 @@

+# Virtual Environments
+.venv/
+venv/
+env/
+# Python caching
+__pycache__/
+*.pyc
+.pytest_cache/
+# Environment Variables
+.env
+# MacOS
+.DS_Store
+# Tool outputs (uv)
+uv.lock

Dockerfile ADDED Viewed

	@@ -0,0 +1,18 @@

+FROM python:3.11-slim
+WORKDIR /app
+# Install dependencies directly to be lightweight
+RUN pip install --no-cache-dir pydantic openai fastapi uvicorn
+# Copy project files
+COPY . .
+# Set default env vars
+ENV PYTHONUNBUFFERED=1
+# Expose HF Spaces port
+EXPOSE 7860
+# Run the FastAPI server by default
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]

PRD.md ADDED Viewed

	@@ -0,0 +1,25 @@

+# Product Requirements Document (PRD): Support Ticket Environment for OpenEnv
+## 1. Introduction and Objectives
+The **Support Ticket Environment** aims to test Large Language Models (LLMs) and agentic frameworks in a highly realistic, consequence-driven enterprise setting. Customer support resolution requires strict adherence to internal policies, information verification, and multi-step reasoning before taking terminal actions (e.g., refunds or escalations).
+**Objective**: Provide an OpenEnv-compliant simulation where an agent assumes the role of a support professional. The environment acts as an adversarial and deterministic evaluator to cleanly quantify an agent's ability to gather state, read contextual rules, and execute appropriate API actions.
+## 2. Real-World Utility
+Most AI evaluations focus on static benchmarks (MMLU) or gamified environments (Minecraft). However, the most immediate commercial application of agentic AI is customer support automation.
+* **The Problem**: Companies lose millions to unchecked LLM agents hallucinating policies, issuing improper refunds, or frustrating high-tier enterprise clients.
+* **The Solution**: This environment models the actual complexity of a ticketing system. It enforces that agents must securely verify `UserData`, correctly attribute `IssueType` to a `Policy`, and avoid taking destructive actions (like rejecting an enterprise client abruptly) under pressure or when faced with confusing queries.
+## 3. Environment Architecture
+- **State Boundaries**: Each task begins with a newly opened ticket. The episode terminates either when the agent explicitly uses a terminal action (`close_ticket`, `escalate`) or after reaching the hard threshold of $N=10$ steps.
+- **Action Constraints**: Intermediate actions (`fetch_user_data`, `check_policy`) do not alter the external ticket state but provide critical context. Terminal actions irreversibly mutate the state and trigger evaluation.
+- **Grading and Reward Shaping**:
+   - Graders are strictly deterministic.
+   - Fractional rewards are yielded for necessary intermediate contextualization steps (promoting chain-of-thought grounding).
+   - Sharp penalties are applied for protocol violations (e.g., escalating a simple refund directly to billing Tier 2).
+## 4. Required Agent Capabilities
+To succeed on hard tasks, an agent must demonstrate:
+- **State Management**: Remembering the constraints of the `policy` retrieved earlier in the episode.
+- **Self-Correction**: Adapting if `fetch_user_data` returns constraints (e.g., the user is not a premium member).
+- **Nuanced Execution**: Apologizing organically when generating the `reply_to_customer` response during a high-stakes failure ticket.

README.md ADDED Viewed

	@@ -0,0 +1,41 @@

+# OpenEnv: Support Ticket Resolution System
+An OpenEnv standards-compliant simulated customer support environment. The agent takes the role of a support professional and resolves tickets using realistic multi-step processes such as verifying users, checking policies, and issuing actions (refunds, escalations, replies).
+## Motivation & Real-world Relevance
+*Please see our detailed [Product Requirements Document (PRD.md)](./PRD.md) for full breakdown.*
+Most AI evaluations involve games or static code benchmarks. This environment measures how accurately an agent can navigate a realistic business process, following internal company logic before issuing potentially destructive operations (e.g., refunds or enterprise escalations). It rewards adherence to protocol (partial rewards for checking policy) and penalizes hasty or contradictory actions.
+## Tasks
+* **Easy (`task_easy_1`)**: Straightforward accidental purchase refund. Agent simply checks policy, refunds, and closes.
+* **Medium (`task_medium_1`)**: Refund request clearly violating policy. Agent must politely reject and close, not refund.
+* **Hard (`task_hard_1`)**: Enterprise customer complains about multi-month double charges. Agent must verify user data, realize the urgency of tier 2 support, apologize, and properly escalate without closing abruptly.
+## Action Space
+`fetch_user_data(user_id)`
+`check_policy(issue_type)`
+`issue_refund(amount)`
+`reply_to_customer(message)`
+`escalate(reason)`
+`close_ticket(resolution)`
+## Observation Space
+Provides details on the current `ticket`, `available_actions`, `history` of past actions, active `system_message`, and the latest `tool_output`.
+## Setup and Run
+Using Docker:
+```bash
+docker build -t openenv_support .
+# Run API Server (HF Spaces mode):
+docker run -p 7860:7860 openenv_support
+```
+Run baseline inference test script locally:
+Ensure you install `pydantic` and `openai` first.
+```bash
+export OPENAI_API_KEY="your-key"
+export MODEL_NAME="gpt-4o"
+python inference.py
+```

inference.py ADDED Viewed

	@@ -0,0 +1,127 @@

+import os
+import json
+import asyncio
+from typing import List, Optional
+from openai import OpenAI
+from env.environment import SupportTicketEnv
+from env.models import Action
+API_BASE_URL = os.getenv("API_BASE_URL", "https://api.openai.com/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "gpt-4o-mini")
+HF_TOKEN = os.getenv("HF_TOKEN")
+LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
+MAX_STEPS = 10
+MAX_TOTAL_REWARD = 1.0
+SUCCESS_SCORE_THRESHOLD = 0.8
+def log_start(task: str, env: str, model: str):
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str] = None):
+    err_str = f" error={error}" if error else ""
+    print(f"[STEP] step={step} action={action!r} reward={reward} done={done}{err_str}", flush=True)
+def log_end(success: bool, steps: int, score: float, rewards: list):
+    print(f"[END] success={success} steps={steps} score={score} rewards={rewards}", flush=True)
+def parse_action(text: str) -> Action:
+    try:
+        start_idx = text.find('{')
+        end_idx = text.rfind('}') + 1
+        if start_idx != -1 and end_idx != -1:
+            json_str = text[start_idx:end_idx]
+            data = json.loads(json_str)
+            return Action(
+                action_type=data.get("action_type", "close_ticket"),
+                parameters=data.get("parameters", {})
+            )
+    except Exception:
+        pass
+    return Action(action_type="close_ticket", parameters={"resolution": "invalid"})
+def get_model_message(client, step: int, env_state: str, history: List[str]) -> str:
+    system_prompt = (
+        "You are an AI support agent resolving customer tickets.\n"
+        "Available Actions:\n"
+        "- fetch_user_data(user_id)\n"
+        "- check_policy(issue_type)\n"
+        "- issue_refund(amount)\n"
+        "- reply_to_customer(message)\n"
+        "- escalate(reason)\n"
+        "- close_ticket(resolution)\n\n"
+        "Must respond with JSON format:\n"
+        "{\"action_type\": \"...\", \"parameters\": {\"...\": \"...\"}}"
+    )
+    history_str = "\n".join(history)
+    user_prompt = f"History:\n{history_str}\n\nCurrent Observation:\n{env_state}\n\nWhat is your next action JSON?"
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": system_prompt},
+                {"role": "user", "content": user_prompt}
+            ],
+            temperature=0.1
+        )
+        text = (completion.choices[0].message.content or "").strip()
+        return text if text else "{}"
+    except Exception as exc:
+        print(f"[DEBUG] Model request failed: {exc}", flush=True)
+        return "{}"
+async def run_task(task_id: str, client: OpenAI) -> None:
+    env = SupportTicketEnv(task_id=task_id)
+    history: List[str] = []
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    log_start(task=task_id, env="SupportTicketEnv", model=MODEL_NAME)
+    try:
+        obs = env.reset()
+        last_echoed = obs.model_dump_json(indent=2)
+        last_reward = 0.0
+        for step in range(1, MAX_STEPS + 1):
+            if env.state.is_done:
+                break
+            message = get_model_message(client, step, last_echoed, history)
+            action = parse_action(message)
+            obs_obj, reward, done, info = env.step(action)
+            obs_json = obs_obj.model_dump_json(indent=2)
+            error = None
+            actual_reward = info.get("current_reward", 0.0)
+            rewards.append(actual_reward)
+            steps_taken = step
+            last_echoed = obs_json
+            last_reward = actual_reward
+            log_step(step=step, action=message, reward=actual_reward, done=done, error=error)
+            history.append(f"Step {step}: {message!r} -> reward {actual_reward:+.2f}")
+            if done:
+                score = actual_reward
+                break
+        score = min(max(score, 0.0), 1.0)
+        success = score >= SUCCESS_SCORE_THRESHOLD
+    finally:
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+async def main() -> None:
+    api_key = os.getenv("OPENAI_API_KEY", "dummy-key")
+    client = OpenAI(base_url=API_BASE_URL, api_key=api_key)
+    tasks = ["task_easy_1", "task_medium_1", "task_hard_1"]
+    for task_id in tasks:
+        await run_task(task_id, client)
+if __name__ == "__main__":
+    asyncio.run(main())

openenv.yaml ADDED Viewed

	@@ -0,0 +1,29 @@

+name: "SupportTicketEnv"
+version: "1.0.0"
+description: >
+  A real-world OpenEnv environment simulating a customer support ticketing system.
+  The agent must process open tickets by optionally fetching user data, checking internal policies,
+  and taking terminal actions like issuing a refund, replying, escalating, or closing the ticket.
+action_space:
+  type: "dict"
+  schema: |
+    {
+      "action_type": "[fetch_user_data, check_policy, issue_refund, reply_to_customer, escalate, close_ticket]",
+      "parameters": {"param_name": "param_value"}
+    }
+observation_space:
+  type: "dict"
+  schema: |
+    {
+      "ticket": {"TicketInfo object"},
+      "available_actions": ["list of strings"],
+      "system_message": "string",
+      "history": ["List of strings of past actions"],
+      "tool_output": "Optional string of the latest action output",
+      "step_count": "integer"
+    }
+reward_description: >
+  The reward is between 0.0 and 1.0. Partial credit is given for taking correct
+  intermediate steps (like checking policy before acting or fetching user data).
+  Penalties are applied for taking contradictory or destructive actions
+  (e.g., escalating unnecessarily, issuing refunds against policy).

pyproject.toml ADDED Viewed

	@@ -0,0 +1,21 @@

+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[project]
+name = "support-ticket-env"
+version = "1.0.0"
+description = "A real-world OpenEnv environment simulating a customer support ticketing system."
+readme = "README.md"
+dependencies = [
+    "pydantic>=2.0",
+    "openenv-core>=0.2.0",
+]
+[project.scripts]
+server = "server.app:main"
+[project.optional-dependencies]
+dev = [
+    "pytest",
+]

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+pydantic>=2.0
+openai>=1.0.0
+fastapi
+uvicorn
+openenv-core

server/app.py ADDED Viewed

	@@ -0,0 +1,53 @@

+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel
+from env.environment import SupportTicketEnv
+from env.models import Action
+app = FastAPI(title="OpenEnv Support Ticket API")
+CURRENT_ENV_SESSION = None
+class InitRequest(BaseModel):
+    task_id: str = "task_easy_1"
+@app.get("/")
+def read_root():
+    return {"status": "ok", "message": "Support Ticket OpenEnv is live."}
+@app.post("/reset")
+def reset_env(req: InitRequest):
+    global CURRENT_ENV_SESSION
+    try:
+        CURRENT_ENV_SESSION = SupportTicketEnv(task_id=req.task_id)
+        obs = CURRENT_ENV_SESSION.reset()
+        return {"observation": obs.model_dump()}
+    except ValueError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+@app.post("/step")
+def step_env(action: Action):
+    global CURRENT_ENV_SESSION
+    if not CURRENT_ENV_SESSION:
+        raise HTTPException(status_code=400, detail="Environment not initialized. Call /reset first.")
+    obs, reward, done, info = CURRENT_ENV_SESSION.step(action)
+    return {
+        "observation": obs.model_dump(),
+        "reward": reward,
+        "done": done,
+        "info": info
+    }
+@app.get("/state")
+def state_env():
+    global CURRENT_ENV_SESSION
+    if not CURRENT_ENV_SESSION:
+        raise HTTPException(status_code=400, detail="Environment not initialized.")
+    return CURRENT_ENV_SESSION.get_state().model_dump()
+def main():
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=7860)
+if __name__ == "__main__":
+    main()