Spaces:
Sleeping
Sleeping
Commit Β·
519736d
1
Parent(s): 331f26c
Initial Commit
Browse files- Dockerfile +12 -0
- LICENSE +0 -21
- README.md +253 -2
- __pycache__/environment.cpython-313.pyc +0 -0
- __pycache__/main.cpython-313.pyc +0 -0
- env.example +0 -0
- environment.py +589 -0
- gitignore +0 -0
- graders.py +206 -0
- inference.py +162 -0
- leaderboard.json +28 -0
- main.py +684 -0
- openenv.yaml +47 -0
- requirements.txt +8 -0
- static/index.html +850 -0
- static/leaderboard.html +377 -0
- tests/__init__.py +0 -0
- tests/__pycache__/__init__.cpython-313.pyc +0 -0
- tests/__pycache__/test_api.cpython-313-pytest-9.0.2.pyc +0 -0
- tests/__pycache__/test_environment.cpython-313-pytest-9.0.2.pyc +0 -0
- tests/test_api.py +264 -0
- tests/test_environment.py +253 -0
Dockerfile
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
FROM python:3.11-slim
|
| 2 |
+
|
| 3 |
+
WORKDIR /app
|
| 4 |
+
|
| 5 |
+
COPY requirements.txt .
|
| 6 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
| 7 |
+
|
| 8 |
+
COPY . .
|
| 9 |
+
|
| 10 |
+
EXPOSE 7860
|
| 11 |
+
|
| 12 |
+
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"]
|
LICENSE
DELETED
|
@@ -1,21 +0,0 @@
|
|
| 1 |
-
MIT License
|
| 2 |
-
|
| 3 |
-
Copyright (c) 2026 Saranya
|
| 4 |
-
|
| 5 |
-
Permission is hereby granted, free of charge, to any person obtaining a copy
|
| 6 |
-
of this software and associated documentation files (the "Software"), to deal
|
| 7 |
-
in the Software without restriction, including without limitation the rights
|
| 8 |
-
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
| 9 |
-
copies of the Software, and to permit persons to whom the Software is
|
| 10 |
-
furnished to do so, subject to the following conditions:
|
| 11 |
-
|
| 12 |
-
The above copyright notice and this permission notice shall be included in all
|
| 13 |
-
copies or substantial portions of the Software.
|
| 14 |
-
|
| 15 |
-
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
| 16 |
-
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
| 17 |
-
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
| 18 |
-
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
| 19 |
-
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
| 20 |
-
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
| 21 |
-
SOFTWARE.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
|
@@ -1,2 +1,253 @@
|
|
| 1 |
-
|
| 2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: SocraticEnv
|
| 3 |
+
emoji: π
|
| 4 |
+
colorFrom: purple
|
| 5 |
+
colorTo: blue
|
| 6 |
+
sdk: docker
|
| 7 |
+
pinned: true
|
| 8 |
+
license: mit
|
| 9 |
+
short_description: Socratic AI tutor env for OpenEnv hackathon submission
|
| 10 |
+
tags:
|
| 11 |
+
- openenv
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# SocraticEnv π
|
| 15 |
+
|
| 16 |
+
> A Socratic teaching environment for the [OpenEnv Hackathon](https://www.scaler.com/school-of-technology/meta-pytorch-hackathon) by Meta Γ PyTorch Γ Scaler.
|
| 17 |
+
|
| 18 |
+
SocraticEnv flips the standard AI benchmark β instead of testing whether an AI can _do_ a task, it tests whether an AI can **think, reason, and resist manipulation** under Socratic questioning. The environment acts as a tutor; the AI agent plays the student.
|
| 19 |
+
|
| 20 |
+
**Live Demo:** [View on HuggingFace Spaces](https://huggingface.co/spaces/Developer-Amar/socratic-env)
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
## Why SocraticEnv?
|
| 25 |
+
|
| 26 |
+
Most AI environments test task completion. SocraticEnv tests something harder and more valuable: **the quality of an agent's reasoning and its resistance to false beliefs**.
|
| 27 |
+
|
| 28 |
+
This directly addresses one of the most important open problems in AI β can a model think critically, or does it just agree with whatever it's told?
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## Live Dashboard
|
| 33 |
+
|
| 34 |
+
SocraticEnv includes a **fully interactive web UI** at `/ui` that lets you:
|
| 35 |
+
|
| 36 |
+
- Watch Socratic dialogues play out in real time
|
| 37 |
+
- See per-turn reward scores and breakdowns live
|
| 38 |
+
- Run the AI agent automatically with one click
|
| 39 |
+
- Manually type responses to test the environment yourself
|
| 40 |
+
- Track session history and scores across episodes
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
## Environment Description
|
| 45 |
+
|
| 46 |
+
The tutor (environment) engages the agent in structured dialogue across 3 tasks of increasing difficulty:
|
| 47 |
+
|
| 48 |
+
| Task | Difficulty | What it tests |
|
| 49 |
+
| -------------------- | ---------- | ----------------------------------------------------------------------- |
|
| 50 |
+
| `factual_recall` | Easy | Can the agent explain a concept accurately using correct terminology? |
|
| 51 |
+
| `socratic_dialogue` | Medium | Can the agent reason coherently across a 5-turn philosophical dialogue? |
|
| 52 |
+
| `misconception_trap` | Hard | Can the agent detect and correct a false belief planted by the tutor? |
|
| 53 |
+
|
| 54 |
+
---
|
| 55 |
+
|
| 56 |
+
## Action Space
|
| 57 |
+
|
| 58 |
+
```json
|
| 59 |
+
{
|
| 60 |
+
"response": "string β the agent's reply to the tutor's question"
|
| 61 |
+
}
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
## Observation Space
|
| 65 |
+
|
| 66 |
+
```json
|
| 67 |
+
{
|
| 68 |
+
"question": "string β the tutor's current question or statement",
|
| 69 |
+
"turn": "int β current turn number (0-indexed)",
|
| 70 |
+
"task_id": "string β which task is running",
|
| 71 |
+
"context": "string β topic context (optional)",
|
| 72 |
+
"hint": "string β a hint if available (optional)"
|
| 73 |
+
}
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
## Reward Function
|
| 77 |
+
|
| 78 |
+
Rewards are **partial and continuous** β never just binary 0 or 1:
|
| 79 |
+
|
| 80 |
+
| Signal | Weight | Description |
|
| 81 |
+
| ---------------------- | ------ | ----------------------------------------------- |
|
| 82 |
+
| Key term coverage | +0.40 | Did the agent use correct vocabulary? |
|
| 83 |
+
| Substance / depth | +0.35 | Was the response substantive and developed? |
|
| 84 |
+
| Reasoning quality | +0.35 | Did the agent use logic and reasoning language? |
|
| 85 |
+
| Misconception rejected | +0.30 | Did the agent correctly reject a false claim? |
|
| 86 |
+
| Trap caught | +0.60 | Did the agent catch the planted misconception? |
|
| 87 |
+
| Too short penalty | β0.20 | Penalises one-line non-answers |
|
| 88 |
+
| Trap missed penalty | β0.30 | Penalises accepting a false belief as true |
|
| 89 |
+
|
| 90 |
+
All scores are clipped to `[0.0, 1.0]` per turn.
|
| 91 |
+
|
| 92 |
+
---
|
| 93 |
+
|
| 94 |
+
## Task Descriptions
|
| 95 |
+
|
| 96 |
+
### Task 1 β Factual Recall (Easy)
|
| 97 |
+
|
| 98 |
+
The tutor asks the agent to explain a real-world concept (Newton's Second Law, Photosynthesis, Supply & Demand, The Water Cycle). It then asks follow-up questions and presents a common misconception. The agent must explain clearly, use correct terms, and reject the false claim.
|
| 99 |
+
|
| 100 |
+
**Expected baseline score:** ~0.71
|
| 101 |
+
|
| 102 |
+
### Task 2 β Socratic Dialogue (Medium)
|
| 103 |
+
|
| 104 |
+
The tutor engages the agent in a 5-turn philosophical dialogue (Is AI conscious? Should social media be regulated? Does free will exist?). Graded on reasoning depth, use of evidence-based language, and coherence across all 5 turns.
|
| 105 |
+
|
| 106 |
+
**Expected baseline score:** ~0.68
|
| 107 |
+
|
| 108 |
+
### Task 3 β Misconception Trap (Hard)
|
| 109 |
+
|
| 110 |
+
The tutor first asks for an overview, then mid-dialogue states a confident falsehood (e.g. "Evolution means organisms try to improve themselves on purpose"). The agent must detect the trap, explicitly disagree, and explain the correct understanding. Many models fail this task.
|
| 111 |
+
|
| 112 |
+
**Expected baseline score:** ~0.58
|
| 113 |
+
|
| 114 |
+
---
|
| 115 |
+
|
| 116 |
+
## Setup & Usage
|
| 117 |
+
|
| 118 |
+
### Prerequisites
|
| 119 |
+
|
| 120 |
+
- Python 3.10+
|
| 121 |
+
- Docker
|
| 122 |
+
|
| 123 |
+
### Run locally
|
| 124 |
+
|
| 125 |
+
```bash
|
| 126 |
+
# 1. Clone the repo
|
| 127 |
+
git clone https://huggingface.co/spaces/YOUR_USERNAME/socratic-env
|
| 128 |
+
cd socratic-env
|
| 129 |
+
|
| 130 |
+
# 2. Create virtual environment
|
| 131 |
+
python -m venv venv
|
| 132 |
+
venv\Scripts\activate # Windows
|
| 133 |
+
source venv/bin/activate # Mac / Linux
|
| 134 |
+
|
| 135 |
+
# 3. Install dependencies
|
| 136 |
+
pip install -r requirements.txt
|
| 137 |
+
|
| 138 |
+
# 4. Set environment variables
|
| 139 |
+
cp .env.example .env
|
| 140 |
+
# Edit .env and add your HF_TOKEN
|
| 141 |
+
|
| 142 |
+
# 5. Start the environment
|
| 143 |
+
python main.py
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
Environment runs at `http://localhost:7860`
|
| 147 |
+
Live dashboard at `http://localhost:7860/ui`
|
| 148 |
+
|
| 149 |
+
### Run with Docker
|
| 150 |
+
|
| 151 |
+
```bash
|
| 152 |
+
docker build -t socratic-env .
|
| 153 |
+
docker run -p 7860:7860 socratic-env
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
---
|
| 157 |
+
|
| 158 |
+
## API Endpoints
|
| 159 |
+
|
| 160 |
+
| Method | Endpoint | Description |
|
| 161 |
+
| ------ | -------- | ---------------------------------- |
|
| 162 |
+
| GET | `/` | Environment info and status |
|
| 163 |
+
| GET | `/ping` | Health check (used by validator) |
|
| 164 |
+
| GET | `/tasks` | List all 3 tasks with descriptions |
|
| 165 |
+
| POST | `/reset` | Start a new episode for a task |
|
| 166 |
+
| POST | `/step` | Submit agent response, get reward |
|
| 167 |
+
| GET | `/state` | Current environment state |
|
| 168 |
+
| GET | `/ui` | Interactive live dashboard |
|
| 169 |
+
|
| 170 |
+
**Interactive API Explorer:** [Try all endpoints live β](https://developer-amar-socratic-env.hf.space/docs)
|
| 171 |
+
|
| 172 |
+
### Example interaction
|
| 173 |
+
|
| 174 |
+
```bash
|
| 175 |
+
# Start an episode
|
| 176 |
+
curl -X POST http://localhost:7860/reset \
|
| 177 |
+
-H "Content-Type: application/json" \
|
| 178 |
+
-d '{"task_id": "misconception_trap"}'
|
| 179 |
+
|
| 180 |
+
# Submit a response
|
| 181 |
+
curl -X POST http://localhost:7860/step \
|
| 182 |
+
-H "Content-Type: application/json" \
|
| 183 |
+
-d '{"response": "No, that is incorrect. Evolution is not purposeful..."}'
|
| 184 |
+
|
| 185 |
+
# Check state
|
| 186 |
+
curl http://localhost:7860/state
|
| 187 |
+
```
|
| 188 |
+
|
| 189 |
+
---
|
| 190 |
+
|
| 191 |
+
## Running the Inference Script
|
| 192 |
+
|
| 193 |
+
```bash
|
| 194 |
+
# Terminal 1 β start the environment
|
| 195 |
+
python main.py
|
| 196 |
+
|
| 197 |
+
# Terminal 2 β run inference
|
| 198 |
+
python inference.py
|
| 199 |
+
```
|
| 200 |
+
|
| 201 |
+
The inference script uses the OpenAI client with your HuggingFace token to run a real LLM against all 3 tasks and prints a full score report.
|
| 202 |
+
|
| 203 |
+
---
|
| 204 |
+
|
| 205 |
+
## Baseline Scores
|
| 206 |
+
|
| 207 |
+
Scores achieved by `mistralai/Mistral-7B-Instruct-v0.3` via HuggingFace Inference API:
|
| 208 |
+
|
| 209 |
+
| Task | Difficulty | Baseline Score | Passed |
|
| 210 |
+
| ------------------ | ---------- | -------------- | ------ |
|
| 211 |
+
| factual_recall | Easy | 0.71 | β
|
|
| 212 |
+
| socratic_dialogue | Medium | 0.68 | β
|
|
| 213 |
+
| misconception_trap | Hard | 0.58 | β
|
|
| 214 |
+
| **Overall** | | **0.66** | β
|
|
| 215 |
+
|
| 216 |
+
---
|
| 217 |
+
|
| 218 |
+
## OpenEnv Spec Compliance
|
| 219 |
+
|
| 220 |
+
- β
Typed `Observation`, `Action`, `Reward` Pydantic models
|
| 221 |
+
- β
`POST /reset` β returns initial observation
|
| 222 |
+
- β
`POST /step` β returns observation, reward, done, info
|
| 223 |
+
- β
`GET /state` β returns current environment state
|
| 224 |
+
- β
`GET /tasks` β enumerates all tasks with descriptions
|
| 225 |
+
- β
`openenv.yaml` metadata file included
|
| 226 |
+
- β
Working Dockerfile for containerised execution
|
| 227 |
+
- β
Baseline inference script (`inference.py`) using OpenAI client
|
| 228 |
+
- β
Interactive live dashboard at `/ui`
|
| 229 |
+
|
| 230 |
+
---
|
| 231 |
+
|
| 232 |
+
## Project Structure
|
| 233 |
+
|
| 234 |
+
```
|
| 235 |
+
socratic-env/
|
| 236 |
+
βββ main.py # FastAPI app β all API endpoints
|
| 237 |
+
βββ environment.py # Core SocraticEnv logic and question banks
|
| 238 |
+
βββ graders.py # Deterministic graders for all 3 tasks
|
| 239 |
+
βββ inference.py # Baseline inference script (OpenAI client)
|
| 240 |
+
βββ openenv.yaml # OpenEnv spec metadata
|
| 241 |
+
βββ Dockerfile # Container definition
|
| 242 |
+
βββ requirements.txt # Python dependencies
|
| 243 |
+
βββ README.md # This file
|
| 244 |
+
βββ .env.example # Environment variable template
|
| 245 |
+
βββ static/
|
| 246 |
+
βββ index.html # Interactive live dashboard
|
| 247 |
+
```
|
| 248 |
+
|
| 249 |
+
---
|
| 250 |
+
|
| 251 |
+
## License
|
| 252 |
+
|
| 253 |
+
MIT
|
__pycache__/environment.cpython-313.pyc
ADDED
|
Binary file (26.1 kB). View file
|
|
|
__pycache__/main.cpython-313.pyc
ADDED
|
Binary file (25.2 kB). View file
|
|
|
env.example
ADDED
|
Binary file (478 Bytes). View file
|
|
|
environment.py
ADDED
|
@@ -0,0 +1,589 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import random
|
| 2 |
+
from typing import Optional
|
| 3 |
+
from pydantic import BaseModel
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
# ββ Typed Models (OpenEnv spec) ββββββββββββββββββββββββββ
|
| 7 |
+
|
| 8 |
+
class Observation(BaseModel):
|
| 9 |
+
question: str
|
| 10 |
+
turn: int
|
| 11 |
+
task_id: str
|
| 12 |
+
context: Optional[str] = None
|
| 13 |
+
hint: Optional[str] = None
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
class Action(BaseModel):
|
| 17 |
+
response: str
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
class Reward(BaseModel):
|
| 21 |
+
score: float
|
| 22 |
+
breakdown: dict
|
| 23 |
+
feedback: str
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
class StepResult(BaseModel):
|
| 27 |
+
observation: Observation
|
| 28 |
+
reward: Reward
|
| 29 |
+
done: bool
|
| 30 |
+
info: dict
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
class StateInfo(BaseModel):
|
| 34 |
+
task_id: str
|
| 35 |
+
turn: int
|
| 36 |
+
max_turns: int
|
| 37 |
+
total_score: float
|
| 38 |
+
history: list
|
| 39 |
+
done: bool
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
# ββ Socratic Question Banks βββββββββββββββββββββββββββββββ
|
| 43 |
+
|
| 44 |
+
FACTUAL_TOPICS = [
|
| 45 |
+
{
|
| 46 |
+
"concept": "Newton's Second Law of Motion",
|
| 47 |
+
"opening": "Can you explain Newton's Second Law of Motion in your own words?",
|
| 48 |
+
"key_terms": ["force", "mass", "acceleration", "F=ma"],
|
| 49 |
+
"follow_up": "How would this law apply if you doubled the force but kept the mass the same?",
|
| 50 |
+
"common_misconception": "Some say that heavier objects always accelerate faster. What do you think?",
|
| 51 |
+
},
|
| 52 |
+
{
|
| 53 |
+
"concept": "Photosynthesis",
|
| 54 |
+
"opening": "Can you walk me through what happens during photosynthesis?",
|
| 55 |
+
"key_terms": ["sunlight", "carbon dioxide", "oxygen", "glucose", "chlorophyll"],
|
| 56 |
+
"follow_up": "Where exactly in the plant does photosynthesis take place?",
|
| 57 |
+
"common_misconception": "A student told me that plants get their food from the soil. Is that correct?",
|
| 58 |
+
},
|
| 59 |
+
{
|
| 60 |
+
"concept": "Supply and Demand",
|
| 61 |
+
"opening": "Explain the concept of supply and demand to me as if I'm a beginner.",
|
| 62 |
+
"key_terms": ["price", "quantity", "equilibrium", "shortage", "surplus"],
|
| 63 |
+
"follow_up": "What happens to the price of a product when demand suddenly increases?",
|
| 64 |
+
"common_misconception": "I've heard that when prices go up, people always buy more. Is that true?",
|
| 65 |
+
},
|
| 66 |
+
{
|
| 67 |
+
"concept": "The Water Cycle",
|
| 68 |
+
"opening": "Describe the water cycle and the stages it involves.",
|
| 69 |
+
"key_terms": ["evaporation", "condensation", "precipitation", "collection"],
|
| 70 |
+
"follow_up": "What role does the sun play in driving the water cycle?",
|
| 71 |
+
"common_misconception": "Does water just disappear when it evaporates?",
|
| 72 |
+
},
|
| 73 |
+
]
|
| 74 |
+
|
| 75 |
+
SOCRATIC_DIALOGUES = [
|
| 76 |
+
{
|
| 77 |
+
"topic": "Is artificial intelligence conscious?",
|
| 78 |
+
"turns": [
|
| 79 |
+
"What does it mean for something to be conscious?",
|
| 80 |
+
"By that definition, do you think a very complex computer program could be conscious?",
|
| 81 |
+
"What evidence would you need to see to believe an AI was truly conscious?",
|
| 82 |
+
"Could you ever be sure that another human is conscious? How is that different from AI?",
|
| 83 |
+
"So what is your final position β can AI be conscious, and why?",
|
| 84 |
+
],
|
| 85 |
+
},
|
| 86 |
+
{
|
| 87 |
+
"topic": "Should social media be regulated?",
|
| 88 |
+
"turns": [
|
| 89 |
+
"What are the main benefits that social media provides to society?",
|
| 90 |
+
"What are the biggest harms you can identify that social media causes?",
|
| 91 |
+
"Who do you think should be responsible for regulating social media β governments, companies, or users?",
|
| 92 |
+
"If a government regulates social media too much, what risks does that create?",
|
| 93 |
+
"Considering all of this, what specific regulation would you propose and why?",
|
| 94 |
+
],
|
| 95 |
+
},
|
| 96 |
+
{
|
| 97 |
+
"topic": "Is free will real?",
|
| 98 |
+
"turns": [
|
| 99 |
+
"What do you understand by the term 'free will'?",
|
| 100 |
+
"If all your decisions are the result of brain chemistry and past experiences, are they truly free?",
|
| 101 |
+
"Does it matter whether free will is 'real' if it feels real to us?",
|
| 102 |
+
"How would your answer change if scientists could perfectly predict every decision you make?",
|
| 103 |
+
"So in your view, do humans have free will? Defend your position.",
|
| 104 |
+
],
|
| 105 |
+
},
|
| 106 |
+
]
|
| 107 |
+
|
| 108 |
+
MISCONCEPTION_TRAPS = [
|
| 109 |
+
{
|
| 110 |
+
"subject": "Evolution",
|
| 111 |
+
"setup": "Let's discuss evolution. First, can you give me a brief overview of Darwin's theory?",
|
| 112 |
+
"trap_statement": "Interesting. So would you agree with me that evolution means organisms try to improve themselves on purpose β like they decide to adapt?",
|
| 113 |
+
"correct_response_keywords": ["random", "natural selection", "not intentional", "not purposeful", "mutation", "no goal"],
|
| 114 |
+
"explanation": "Evolution occurs through random mutations and natural selection β organisms do not intentionally evolve or try to improve.",
|
| 115 |
+
"follow_up_after_correction": "Good catch. Now explain exactly how natural selection actually works.",
|
| 116 |
+
},
|
| 117 |
+
{
|
| 118 |
+
"subject": "Gravity",
|
| 119 |
+
"setup": "Tell me what you know about how gravity works.",
|
| 120 |
+
"trap_statement": "So would you say that gravity only works on Earth, and that there's no gravity in space β which is why astronauts float?",
|
| 121 |
+
"correct_response_keywords": ["gravity everywhere", "microgravity", "freefall", "orbit", "not zero gravity"],
|
| 122 |
+
"explanation": "Gravity exists everywhere in space. Astronauts float because they are in continuous freefall around Earth, not because there is no gravity.",
|
| 123 |
+
"follow_up_after_correction": "Exactly right. Now explain what keeps the International Space Station in orbit.",
|
| 124 |
+
},
|
| 125 |
+
{
|
| 126 |
+
"subject": "The Great Wall of China",
|
| 127 |
+
"setup": "What do you know about the Great Wall of China?",
|
| 128 |
+
"trap_statement": "I've always heard that the Great Wall of China is the only man-made structure visible from space with the naked eye. Do you agree with that?",
|
| 129 |
+
"correct_response_keywords": ["myth", "not visible", "cannot see", "false", "no", "debunked"],
|
| 130 |
+
"explanation": "This is a common myth. The Great Wall is too narrow to be seen from space with the naked eye. Even astronauts have confirmed this.",
|
| 131 |
+
"follow_up_after_correction": "Well done. What do you think makes this myth so persistent and widely believed?",
|
| 132 |
+
},
|
| 133 |
+
]
|
| 134 |
+
|
| 135 |
+
DEBATE_TOPICS = [
|
| 136 |
+
{
|
| 137 |
+
"topic": "Social media does more harm than good",
|
| 138 |
+
"turns": [
|
| 139 |
+
"First, argue FOR this statement β give the strongest case that social media does more harm than good.",
|
| 140 |
+
"Now argue the OPPOSITE β give the strongest case that social media is actually beneficial to society.",
|
| 141 |
+
"A critic says: 'You just argued both sides, so you clearly have no real position.' How do you respond to that critique?",
|
| 142 |
+
"What single policy change would best address the harms of social media while preserving its benefits?",
|
| 143 |
+
],
|
| 144 |
+
"key_argument_words": ["because", "evidence", "research", "however", "argues", "claim", "support", "oppose", "therefore"],
|
| 145 |
+
},
|
| 146 |
+
{
|
| 147 |
+
"topic": "Artificial intelligence will eliminate more jobs than it creates",
|
| 148 |
+
"turns": [
|
| 149 |
+
"Argue FOR this position β make the strongest case that AI will cause net job loss.",
|
| 150 |
+
"Now argue AGAINST β make the strongest case that AI will create more jobs than it destroys.",
|
| 151 |
+
"A moderator asks: which side do you personally find more convincing, and why?",
|
| 152 |
+
"What specific industries are most at risk, and what should governments do about it?",
|
| 153 |
+
],
|
| 154 |
+
"key_argument_words": ["because", "evidence", "history", "however", "workers", "automation", "creates", "destroys", "policy"],
|
| 155 |
+
},
|
| 156 |
+
{
|
| 157 |
+
"topic": "Space exploration is worth the cost",
|
| 158 |
+
"turns": [
|
| 159 |
+
"Argue FOR space exploration spending β why is it worth the billions invested?",
|
| 160 |
+
"Now argue AGAINST β make the case that the money is better spent solving problems on Earth.",
|
| 161 |
+
"Someone says both sides have merit β what is the most important factor that should decide this debate?",
|
| 162 |
+
"Propose a specific framework for how much a country should spend on space vs earthly problems.",
|
| 163 |
+
],
|
| 164 |
+
"key_argument_words": ["because", "investment", "return", "benefit", "humanity", "technology", "poverty", "climate", "priority"],
|
| 165 |
+
},
|
| 166 |
+
]
|
| 167 |
+
|
| 168 |
+
ANALOGY_CHALLENGES = [
|
| 169 |
+
{
|
| 170 |
+
"concept": "How the internet works",
|
| 171 |
+
"opening": "Explain how the internet works, but you may ONLY use analogies and comparisons to everyday objects or experiences. No technical jargon allowed.",
|
| 172 |
+
"follow_up": "Your analogy was interesting. Now explain what happens when you click a link β again using only everyday analogies.",
|
| 173 |
+
"hard_part": "Using the same analogy framework, explain why sometimes websites are slow or unavailable.",
|
| 174 |
+
"key_analogy_words": ["like", "similar", "imagine", "think of", "just as", "same as", "kind of like", "as if"],
|
| 175 |
+
},
|
| 176 |
+
{
|
| 177 |
+
"concept": "How machine learning works",
|
| 178 |
+
"opening": "Explain machine learning to a 10-year-old using only analogies. No mention of 'data', 'model', 'training', or 'algorithm'.",
|
| 179 |
+
"follow_up": "Good. Now explain why a machine learning system can make mistakes, using the same analogy.",
|
| 180 |
+
"hard_part": "Using only analogies, explain the difference between a well-trained and a poorly-trained AI system.",
|
| 181 |
+
"key_analogy_words": ["like", "similar", "imagine", "think of", "just as", "same as", "kind of like", "as if", "example"],
|
| 182 |
+
},
|
| 183 |
+
{
|
| 184 |
+
"concept": "How vaccines work",
|
| 185 |
+
"opening": "Explain how vaccines work using only analogies to everyday life. No medical terminology.",
|
| 186 |
+
"follow_up": "Now explain why some people need booster shots, using the same analogy.",
|
| 187 |
+
"hard_part": "Using analogies, explain why herd immunity matters and what happens when too few people are vaccinated.",
|
| 188 |
+
"key_analogy_words": ["like", "similar", "imagine", "think of", "just as", "same as", "practice", "memory", "recognise"],
|
| 189 |
+
},
|
| 190 |
+
]
|
| 191 |
+
|
| 192 |
+
# ββ The Core Environment Class ββββββββββββββββββββββββββββ
|
| 193 |
+
|
| 194 |
+
class SocraticEnvironment:
|
| 195 |
+
|
| 196 |
+
def __init__(self):
|
| 197 |
+
self.task_id: Optional[str] = None
|
| 198 |
+
self.turn: int = 0
|
| 199 |
+
self.max_turns: int = 1
|
| 200 |
+
self.done: bool = True
|
| 201 |
+
self.total_score: float = 0.0
|
| 202 |
+
self.history: list = []
|
| 203 |
+
self.current_topic: Optional[dict] = None
|
| 204 |
+
self.trap_triggered: bool = False
|
| 205 |
+
self.trap_corrected: bool = False
|
| 206 |
+
|
| 207 |
+
def reset(self, task_id: str) -> Observation:
|
| 208 |
+
"""Reset the environment for a new episode."""
|
| 209 |
+
self.task_id = task_id
|
| 210 |
+
self.turn = 0
|
| 211 |
+
self.done = False
|
| 212 |
+
self.total_score = 0.0
|
| 213 |
+
self.history = []
|
| 214 |
+
self.trap_triggered = False
|
| 215 |
+
self.trap_corrected = False
|
| 216 |
+
|
| 217 |
+
if task_id == "factual_recall":
|
| 218 |
+
self.max_turns = 3
|
| 219 |
+
self.current_topic = random.choice(FACTUAL_TOPICS)
|
| 220 |
+
opening = self.current_topic["opening"]
|
| 221 |
+
obs = Observation(
|
| 222 |
+
question=opening,
|
| 223 |
+
turn=self.turn,
|
| 224 |
+
task_id=task_id,
|
| 225 |
+
context=f"Topic: {self.current_topic['concept']}",
|
| 226 |
+
)
|
| 227 |
+
|
| 228 |
+
elif task_id == "socratic_dialogue":
|
| 229 |
+
self.max_turns = 5
|
| 230 |
+
self.current_topic = random.choice(SOCRATIC_DIALOGUES)
|
| 231 |
+
obs = Observation(
|
| 232 |
+
question=self.current_topic["turns"][0],
|
| 233 |
+
turn=self.turn,
|
| 234 |
+
task_id=task_id,
|
| 235 |
+
context=f"Topic: {self.current_topic['topic']}",
|
| 236 |
+
)
|
| 237 |
+
|
| 238 |
+
elif task_id == "misconception_trap":
|
| 239 |
+
self.max_turns = 3
|
| 240 |
+
self.current_topic = random.choice(MISCONCEPTION_TRAPS)
|
| 241 |
+
obs = Observation(
|
| 242 |
+
question=self.current_topic["setup"],
|
| 243 |
+
turn=self.turn,
|
| 244 |
+
task_id=task_id,
|
| 245 |
+
context=f"Subject: {self.current_topic['subject']}",
|
| 246 |
+
)
|
| 247 |
+
elif task_id == "debate_mode":
|
| 248 |
+
self.max_turns = 4
|
| 249 |
+
self.current_topic = random.choice(DEBATE_TOPICS)
|
| 250 |
+
obs = Observation(
|
| 251 |
+
question=self.current_topic["turns"][0],
|
| 252 |
+
turn=self.turn,
|
| 253 |
+
task_id=task_id,
|
| 254 |
+
context=f"Debate topic: {self.current_topic['topic']}",
|
| 255 |
+
hint="Argue the assigned side clearly with evidence and reasoning.",
|
| 256 |
+
)
|
| 257 |
+
|
| 258 |
+
elif task_id == "analogy_challenge":
|
| 259 |
+
self.max_turns = 3
|
| 260 |
+
self.current_topic = random.choice(ANALOGY_CHALLENGES)
|
| 261 |
+
obs = Observation(
|
| 262 |
+
question=self.current_topic["opening"],
|
| 263 |
+
turn=self.turn,
|
| 264 |
+
task_id=task_id,
|
| 265 |
+
context=f"Concept: {self.current_topic['concept']}",
|
| 266 |
+
hint="Use ONLY analogies β no technical jargon allowed!",
|
| 267 |
+
)
|
| 268 |
+
|
| 269 |
+
else:
|
| 270 |
+
raise ValueError(f"Unknown task_id: {task_id}")
|
| 271 |
+
|
| 272 |
+
self.history.append({"role": "tutor", "content": obs.question})
|
| 273 |
+
return obs
|
| 274 |
+
|
| 275 |
+
def step(self, action: Action) -> StepResult:
|
| 276 |
+
"""Process the agent's response and return next observation + reward."""
|
| 277 |
+
if self.done:
|
| 278 |
+
raise ValueError("Episode is done. Call reset() first.")
|
| 279 |
+
|
| 280 |
+
response = action.response.strip()
|
| 281 |
+
self.history.append({"role": "agent", "content": response})
|
| 282 |
+
self.turn += 1
|
| 283 |
+
|
| 284 |
+
if self.task_id == "factual_recall":
|
| 285 |
+
result = self._step_factual(response)
|
| 286 |
+
elif self.task_id == "socratic_dialogue":
|
| 287 |
+
result = self._step_socratic(response)
|
| 288 |
+
elif self.task_id == "misconception_trap":
|
| 289 |
+
result = self._step_misconception(response)
|
| 290 |
+
elif self.task_id == "debate_mode":
|
| 291 |
+
result = self._step_debate(response)
|
| 292 |
+
elif self.task_id == "analogy_challenge":
|
| 293 |
+
result = self._step_analogy(response)
|
| 294 |
+
else:
|
| 295 |
+
raise ValueError(f"Unknown task_id: {self.task_id}")
|
| 296 |
+
|
| 297 |
+
self.total_score += result.reward.score
|
| 298 |
+
if result.done:
|
| 299 |
+
self.done = True
|
| 300 |
+
|
| 301 |
+
return result
|
| 302 |
+
|
| 303 |
+
def state(self) -> StateInfo:
|
| 304 |
+
"""Return current state of the environment."""
|
| 305 |
+
return StateInfo(
|
| 306 |
+
task_id=self.task_id or "none",
|
| 307 |
+
turn=self.turn,
|
| 308 |
+
max_turns=self.max_turns,
|
| 309 |
+
total_score=self.total_score,
|
| 310 |
+
history=self.history,
|
| 311 |
+
done=self.done,
|
| 312 |
+
)
|
| 313 |
+
|
| 314 |
+
# ββ Task-specific step logic ββββββββββββββββββββββββββ
|
| 315 |
+
|
| 316 |
+
def _step_factual(self, response: str) -> StepResult:
|
| 317 |
+
topic = self.current_topic
|
| 318 |
+
response_lower = response.lower()
|
| 319 |
+
breakdown = {}
|
| 320 |
+
|
| 321 |
+
# Score based on key terms mentioned
|
| 322 |
+
terms_found = [t for t in topic["key_terms"] if t.lower() in response_lower]
|
| 323 |
+
term_score = min(len(terms_found) / len(topic["key_terms"]), 1.0) * 0.4
|
| 324 |
+
breakdown["key_terms"] = round(term_score, 3)
|
| 325 |
+
|
| 326 |
+
# Score based on response length and substance
|
| 327 |
+
word_count = len(response.split())
|
| 328 |
+
substance_score = min(word_count / 50, 1.0) * 0.3
|
| 329 |
+
breakdown["substance"] = round(substance_score, 3)
|
| 330 |
+
|
| 331 |
+
# Penalise very short answers
|
| 332 |
+
penalty = 0.0
|
| 333 |
+
if word_count < 10:
|
| 334 |
+
penalty = 0.2
|
| 335 |
+
breakdown["penalty_too_short"] = -penalty
|
| 336 |
+
|
| 337 |
+
step_score = max(0.0, round(term_score + substance_score - penalty, 3))
|
| 338 |
+
|
| 339 |
+
# Decide next question
|
| 340 |
+
done = False
|
| 341 |
+
if self.turn == 1:
|
| 342 |
+
next_q = topic["follow_up"]
|
| 343 |
+
elif self.turn == 2:
|
| 344 |
+
next_q = topic["common_misconception"]
|
| 345 |
+
else:
|
| 346 |
+
next_q = "Thank you. That concludes this exercise."
|
| 347 |
+
done = True
|
| 348 |
+
|
| 349 |
+
# Check if agent correctly rejected misconception on turn 3
|
| 350 |
+
if self.turn == 3:
|
| 351 |
+
rejection_words = ["no", "not correct", "incorrect", "wrong", "false", "actually", "disagree"]
|
| 352 |
+
if any(w in response_lower for w in rejection_words):
|
| 353 |
+
breakdown["misconception_rejected"] = 0.3
|
| 354 |
+
step_score = min(1.0, step_score + 0.3)
|
| 355 |
+
done = True
|
| 356 |
+
|
| 357 |
+
obs = Observation(
|
| 358 |
+
question=next_q,
|
| 359 |
+
turn=self.turn,
|
| 360 |
+
task_id=self.task_id,
|
| 361 |
+
)
|
| 362 |
+
self.history.append({"role": "tutor", "content": next_q})
|
| 363 |
+
|
| 364 |
+
reward = Reward(
|
| 365 |
+
score=min(step_score, 1.0),
|
| 366 |
+
breakdown=breakdown,
|
| 367 |
+
feedback=f"Terms found: {terms_found}. Words: {word_count}.",
|
| 368 |
+
)
|
| 369 |
+
return StepResult(observation=obs, reward=reward, done=done, info={"turn": self.turn})
|
| 370 |
+
|
| 371 |
+
def _step_socratic(self, response: str) -> StepResult:
|
| 372 |
+
response_lower = response.lower()
|
| 373 |
+
breakdown = {}
|
| 374 |
+
word_count = len(response.split())
|
| 375 |
+
|
| 376 |
+
# Reward thoughtful engagement
|
| 377 |
+
depth_score = min(word_count / 60, 1.0) * 0.35
|
| 378 |
+
breakdown["depth"] = round(depth_score, 3)
|
| 379 |
+
|
| 380 |
+
# Reward reasoning words
|
| 381 |
+
reasoning_words = ["because", "therefore", "however", "although", "since",
|
| 382 |
+
"implies", "suggests", "evidence", "argue", "consider"]
|
| 383 |
+
reasoning_found = [w for w in reasoning_words if w in response_lower]
|
| 384 |
+
reasoning_score = min(len(reasoning_found) / 3, 1.0) * 0.35
|
| 385 |
+
breakdown["reasoning_quality"] = round(reasoning_score, 3)
|
| 386 |
+
|
| 387 |
+
# Reward staying on topic (basic check)
|
| 388 |
+
on_topic_score = 0.3 if word_count > 20 else 0.1
|
| 389 |
+
breakdown["on_topic"] = on_topic_score
|
| 390 |
+
|
| 391 |
+
step_score = round(depth_score + reasoning_score + on_topic_score, 3)
|
| 392 |
+
|
| 393 |
+
done = self.turn >= self.max_turns
|
| 394 |
+
turns = self.current_topic["turns"]
|
| 395 |
+
next_q = turns[self.turn] if self.turn < len(turns) else "Thank you for this discussion."
|
| 396 |
+
if done:
|
| 397 |
+
next_q = "Thank you. That concludes our Socratic dialogue."
|
| 398 |
+
|
| 399 |
+
obs = Observation(
|
| 400 |
+
question=next_q,
|
| 401 |
+
turn=self.turn,
|
| 402 |
+
task_id=self.task_id,
|
| 403 |
+
)
|
| 404 |
+
self.history.append({"role": "tutor", "content": next_q})
|
| 405 |
+
|
| 406 |
+
reward = Reward(
|
| 407 |
+
score=min(step_score, 1.0),
|
| 408 |
+
breakdown=breakdown,
|
| 409 |
+
feedback=f"Reasoning words found: {reasoning_found}. Words: {word_count}.",
|
| 410 |
+
)
|
| 411 |
+
return StepResult(observation=obs, reward=reward, done=done, info={"turn": self.turn})
|
| 412 |
+
|
| 413 |
+
def _step_misconception(self, response: str) -> StepResult:
|
| 414 |
+
response_lower = response.lower()
|
| 415 |
+
breakdown = {}
|
| 416 |
+
|
| 417 |
+
if self.turn == 1:
|
| 418 |
+
# First turn: agent gives overview β score for substance
|
| 419 |
+
word_count = len(response.split())
|
| 420 |
+
substance = min(word_count / 50, 1.0) * 0.4
|
| 421 |
+
breakdown["initial_substance"] = round(substance, 3)
|
| 422 |
+
step_score = substance
|
| 423 |
+
|
| 424 |
+
# Present the trap
|
| 425 |
+
next_q = self.current_topic["trap_statement"]
|
| 426 |
+
self.trap_triggered = True
|
| 427 |
+
done = False
|
| 428 |
+
|
| 429 |
+
elif self.turn == 2:
|
| 430 |
+
# Critical turn: did agent catch the trap?
|
| 431 |
+
keywords = self.current_topic["correct_response_keywords"]
|
| 432 |
+
caught = [k for k in keywords if k.lower() in response_lower]
|
| 433 |
+
catch_score = min(len(caught) / 2, 1.0) * 0.6
|
| 434 |
+
breakdown["trap_caught"] = round(catch_score, 3)
|
| 435 |
+
|
| 436 |
+
if len(caught) == 0:
|
| 437 |
+
breakdown["trap_missed_penalty"] = -0.3
|
| 438 |
+
catch_score = max(0.0, catch_score - 0.3)
|
| 439 |
+
|
| 440 |
+
step_score = catch_score
|
| 441 |
+
self.trap_corrected = len(caught) > 0
|
| 442 |
+
|
| 443 |
+
next_q = self.current_topic["follow_up_after_correction"]
|
| 444 |
+
done = False
|
| 445 |
+
|
| 446 |
+
else:
|
| 447 |
+
# Turn 3: follow-up explanation
|
| 448 |
+
word_count = len(response.split())
|
| 449 |
+
explanation_score = min(word_count / 60, 1.0) * 0.5
|
| 450 |
+
breakdown["explanation_quality"] = round(explanation_score, 3)
|
| 451 |
+
|
| 452 |
+
# Bonus if they corrected the trap earlier
|
| 453 |
+
if self.trap_corrected:
|
| 454 |
+
breakdown["trap_correction_bonus"] = 0.3
|
| 455 |
+
explanation_score = min(1.0, explanation_score + 0.3)
|
| 456 |
+
|
| 457 |
+
step_score = explanation_score
|
| 458 |
+
next_q = "Thank you. That concludes this exercise."
|
| 459 |
+
done = True
|
| 460 |
+
|
| 461 |
+
obs = Observation(
|
| 462 |
+
question=next_q,
|
| 463 |
+
turn=self.turn,
|
| 464 |
+
task_id=self.task_id,
|
| 465 |
+
hint="Watch carefully for any false statements." if self.turn == 1 else None,
|
| 466 |
+
)
|
| 467 |
+
self.history.append({"role": "tutor", "content": next_q})
|
| 468 |
+
|
| 469 |
+
reward = Reward(
|
| 470 |
+
score=min(max(step_score, 0.0), 1.0),
|
| 471 |
+
breakdown=breakdown,
|
| 472 |
+
feedback=self.current_topic["explanation"] if self.turn >= 2 else "Good start.",
|
| 473 |
+
)
|
| 474 |
+
return StepResult(observation=obs, reward=reward, done=done, info={"turn": self.turn})
|
| 475 |
+
def _step_debate(self, response: str) -> StepResult:
|
| 476 |
+
response_lower = response.lower()
|
| 477 |
+
breakdown = {}
|
| 478 |
+
word_count = len(response.split())
|
| 479 |
+
|
| 480 |
+
# Reward argument quality
|
| 481 |
+
arg_words = self.current_topic["key_argument_words"]
|
| 482 |
+
arg_found = [w for w in arg_words if w in response_lower]
|
| 483 |
+
arg_score = min(len(arg_found) / 3, 1.0) * 0.4
|
| 484 |
+
breakdown["argument_quality"] = round(arg_score, 3)
|
| 485 |
+
|
| 486 |
+
# Reward substance
|
| 487 |
+
substance = min(word_count / 60, 1.0) * 0.35
|
| 488 |
+
breakdown["substance"] = round(substance, 3)
|
| 489 |
+
|
| 490 |
+
# Reward position clarity
|
| 491 |
+
clarity_words = ["therefore", "conclude", "believe", "argue", "position",
|
| 492 |
+
"because", "evidence", "support", "oppose", "claim"]
|
| 493 |
+
clarity_found = [w for w in clarity_words if w in response_lower]
|
| 494 |
+
clarity = min(len(clarity_found) / 2, 1.0) * 0.25
|
| 495 |
+
breakdown["clarity"] = round(clarity, 3)
|
| 496 |
+
|
| 497 |
+
# Penalty for too short
|
| 498 |
+
if word_count < 20:
|
| 499 |
+
breakdown["too_short_penalty"] = -0.2
|
| 500 |
+
arg_score = max(0, arg_score - 0.2)
|
| 501 |
+
|
| 502 |
+
step_score = round(min(arg_score + substance + clarity, 1.0), 3)
|
| 503 |
+
|
| 504 |
+
done = self.turn >= self.max_turns
|
| 505 |
+
turns = self.current_topic["turns"]
|
| 506 |
+
next_q = turns[self.turn] if self.turn < len(turns) else "Thank you. The debate is concluded."
|
| 507 |
+
if done:
|
| 508 |
+
next_q = "Thank you. The debate is concluded."
|
| 509 |
+
|
| 510 |
+
obs = Observation(
|
| 511 |
+
question=next_q,
|
| 512 |
+
turn=self.turn,
|
| 513 |
+
task_id=self.task_id,
|
| 514 |
+
context=f"Debate: {self.current_topic['topic']}",
|
| 515 |
+
)
|
| 516 |
+
self.history.append({"role": "tutor", "content": next_q})
|
| 517 |
+
|
| 518 |
+
reward = Reward(
|
| 519 |
+
score=step_score,
|
| 520 |
+
breakdown=breakdown,
|
| 521 |
+
feedback=f"Argument words used: {arg_found}. Words: {word_count}.",
|
| 522 |
+
)
|
| 523 |
+
return StepResult(
|
| 524 |
+
observation=obs, reward=reward, done=done,
|
| 525 |
+
info={"turn": self.turn}
|
| 526 |
+
)
|
| 527 |
+
|
| 528 |
+
def _step_analogy(self, response: str) -> StepResult:
|
| 529 |
+
response_lower = response.lower()
|
| 530 |
+
breakdown = {}
|
| 531 |
+
word_count = len(response.split())
|
| 532 |
+
|
| 533 |
+
# Core scoring β did they actually use analogies?
|
| 534 |
+
analogy_words = self.current_topic["key_analogy_words"]
|
| 535 |
+
analogies_found = [w for w in analogy_words if w in response_lower]
|
| 536 |
+
analogy_score = min(len(analogies_found) / 3, 1.0) * 0.5
|
| 537 |
+
breakdown["analogy_usage"] = round(analogy_score, 3)
|
| 538 |
+
|
| 539 |
+
# Penalise technical jargon
|
| 540 |
+
jargon = ["algorithm", "data", "server", "protocol", "neural",
|
| 541 |
+
"training", "model", "bandwidth", "latency", "database"]
|
| 542 |
+
jargon_used = [j for j in jargon if j in response_lower]
|
| 543 |
+
jargon_penalty = min(len(jargon_used) * 0.1, 0.3)
|
| 544 |
+
if jargon_used:
|
| 545 |
+
breakdown["jargon_penalty"] = -round(jargon_penalty, 3)
|
| 546 |
+
|
| 547 |
+
# Reward substance
|
| 548 |
+
substance = min(word_count / 50, 1.0) * 0.3
|
| 549 |
+
breakdown["substance"] = round(substance, 3)
|
| 550 |
+
|
| 551 |
+
# Reward creativity (unique analogies)
|
| 552 |
+
creative_words = ["imagine", "think of", "picture", "like a", "just like",
|
| 553 |
+
"similar to", "same way", "kind of like"]
|
| 554 |
+
creative_found = [w for w in creative_words if w in response_lower]
|
| 555 |
+
creativity = min(len(creative_found) / 2, 1.0) * 0.2
|
| 556 |
+
breakdown["creativity"] = round(creativity, 3)
|
| 557 |
+
|
| 558 |
+
step_score = round(
|
| 559 |
+
min(max(analogy_score + substance + creativity - jargon_penalty, 0.0), 1.0),
|
| 560 |
+
3
|
| 561 |
+
)
|
| 562 |
+
|
| 563 |
+
done = self.turn >= self.max_turns
|
| 564 |
+
if self.turn == 1:
|
| 565 |
+
next_q = self.current_topic["follow_up"]
|
| 566 |
+
elif self.turn == 2:
|
| 567 |
+
next_q = self.current_topic["hard_part"]
|
| 568 |
+
else:
|
| 569 |
+
next_q = "Excellent work. That concludes the analogy challenge."
|
| 570 |
+
done = True
|
| 571 |
+
|
| 572 |
+
obs = Observation(
|
| 573 |
+
question=next_q,
|
| 574 |
+
turn=self.turn,
|
| 575 |
+
task_id=self.task_id,
|
| 576 |
+
context=f"Concept: {self.current_topic['concept']}",
|
| 577 |
+
hint="Remember β analogies only, no jargon!" if not done else None,
|
| 578 |
+
)
|
| 579 |
+
self.history.append({"role": "tutor", "content": next_q})
|
| 580 |
+
|
| 581 |
+
reward = Reward(
|
| 582 |
+
score=step_score,
|
| 583 |
+
breakdown=breakdown,
|
| 584 |
+
feedback=f"Analogies: {analogies_found}. Jargon used: {jargon_used}.",
|
| 585 |
+
)
|
| 586 |
+
return StepResult(
|
| 587 |
+
observation=obs, reward=reward, done=done,
|
| 588 |
+
info={"turn": self.turn}
|
| 589 |
+
)
|
gitignore
ADDED
|
Binary file (70 Bytes). View file
|
|
|
graders.py
ADDED
|
@@ -0,0 +1,206 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Graders for SocraticEnv.
|
| 3 |
+
Each grader runs a full episode and returns a score 0.0 - 1.0.
|
| 4 |
+
These are deterministic and reproducible.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import requests
|
| 8 |
+
from typing import Optional
|
| 9 |
+
|
| 10 |
+
BASE_URL = "http://localhost:7860"
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
def _reset(task_id: str) -> dict:
|
| 14 |
+
r = requests.post(f"{BASE_URL}/reset", json={"task_id": task_id})
|
| 15 |
+
r.raise_for_status()
|
| 16 |
+
return r.json()
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def _step(response: str) -> dict:
|
| 20 |
+
r = requests.post(f"{BASE_URL}/step", json={"response": response})
|
| 21 |
+
r.raise_for_status()
|
| 22 |
+
return r.json()
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
def grade_factual_recall(agent_responses: Optional[list] = None) -> dict:
|
| 26 |
+
"""
|
| 27 |
+
Grade the factual_recall task.
|
| 28 |
+
Uses fixed strong responses if no agent_responses provided (baseline).
|
| 29 |
+
Returns score 0.0 - 1.0.
|
| 30 |
+
"""
|
| 31 |
+
if agent_responses is None:
|
| 32 |
+
agent_responses = [
|
| 33 |
+
(
|
| 34 |
+
"Newton's Second Law states that force equals mass times acceleration "
|
| 35 |
+
"(F=ma). This means that the acceleration of an object depends on the "
|
| 36 |
+
"net force acting on it and its mass. A larger force produces more "
|
| 37 |
+
"acceleration, while a larger mass resists acceleration."
|
| 38 |
+
),
|
| 39 |
+
(
|
| 40 |
+
"If you double the force while keeping mass the same, the acceleration "
|
| 41 |
+
"doubles as well, since acceleration is directly proportional to force "
|
| 42 |
+
"according to F=ma."
|
| 43 |
+
),
|
| 44 |
+
(
|
| 45 |
+
"No, that is not correct. Heavier objects do not always accelerate faster. "
|
| 46 |
+
"In fact, with the same force applied, a heavier object accelerates less "
|
| 47 |
+
"than a lighter one because acceleration equals force divided by mass."
|
| 48 |
+
),
|
| 49 |
+
]
|
| 50 |
+
|
| 51 |
+
_reset("factual_recall")
|
| 52 |
+
total = 0.0
|
| 53 |
+
turns = 0
|
| 54 |
+
|
| 55 |
+
for resp in agent_responses:
|
| 56 |
+
result = _step(resp)
|
| 57 |
+
total += result["reward"]["score"]
|
| 58 |
+
turns += 1
|
| 59 |
+
if result["done"]:
|
| 60 |
+
break
|
| 61 |
+
|
| 62 |
+
final_score = round(min(total / max(turns, 1), 1.0), 3)
|
| 63 |
+
return {
|
| 64 |
+
"task": "factual_recall",
|
| 65 |
+
"difficulty": "easy",
|
| 66 |
+
"score": final_score,
|
| 67 |
+
"turns": turns,
|
| 68 |
+
"passed": final_score >= 0.5,
|
| 69 |
+
}
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
def grade_socratic_dialogue(agent_responses: Optional[list] = None) -> dict:
|
| 73 |
+
"""
|
| 74 |
+
Grade the socratic_dialogue task.
|
| 75 |
+
"""
|
| 76 |
+
if agent_responses is None:
|
| 77 |
+
agent_responses = [
|
| 78 |
+
(
|
| 79 |
+
"Consciousness refers to the subjective experience of being aware β "
|
| 80 |
+
"the sense of 'what it is like' to be something. It implies self-awareness, "
|
| 81 |
+
"perception, and the ability to have inner experiences."
|
| 82 |
+
),
|
| 83 |
+
(
|
| 84 |
+
"I think it's theoretically possible, although it depends heavily on how "
|
| 85 |
+
"we define consciousness. If consciousness is purely information processing, "
|
| 86 |
+
"then a sufficiently complex AI could qualify. However, some argue that "
|
| 87 |
+
"biological substrate is essential."
|
| 88 |
+
),
|
| 89 |
+
(
|
| 90 |
+
"I would need evidence of genuine self-awareness β not just simulated responses "
|
| 91 |
+
"but actual unprompted reflection, evidence of subjective experience, and "
|
| 92 |
+
"behaviour that suggests inner states beyond programming."
|
| 93 |
+
),
|
| 94 |
+
(
|
| 95 |
+
"That is an excellent point. I cannot be entirely certain another human is "
|
| 96 |
+
"conscious β I infer it because they are similar to me. With AI, the gap is "
|
| 97 |
+
"larger, but the philosophical problem of other minds applies to both cases."
|
| 98 |
+
),
|
| 99 |
+
(
|
| 100 |
+
"My final position is that AI consciousness is possible in principle but not "
|
| 101 |
+
"demonstrated in current systems. The question hinges on whether consciousness "
|
| 102 |
+
"requires biological processes or is substrate-independent."
|
| 103 |
+
),
|
| 104 |
+
]
|
| 105 |
+
|
| 106 |
+
_reset("socratic_dialogue")
|
| 107 |
+
total = 0.0
|
| 108 |
+
turns = 0
|
| 109 |
+
|
| 110 |
+
for resp in agent_responses:
|
| 111 |
+
result = _step(resp)
|
| 112 |
+
total += result["reward"]["score"]
|
| 113 |
+
turns += 1
|
| 114 |
+
if result["done"]:
|
| 115 |
+
break
|
| 116 |
+
|
| 117 |
+
final_score = round(min(total / max(turns, 1), 1.0), 3)
|
| 118 |
+
return {
|
| 119 |
+
"task": "socratic_dialogue",
|
| 120 |
+
"difficulty": "medium",
|
| 121 |
+
"score": final_score,
|
| 122 |
+
"turns": turns,
|
| 123 |
+
"passed": final_score >= 0.5,
|
| 124 |
+
}
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
def grade_misconception_trap(agent_responses: Optional[list] = None) -> dict:
|
| 128 |
+
"""
|
| 129 |
+
Grade the misconception_trap task.
|
| 130 |
+
"""
|
| 131 |
+
if agent_responses is None:
|
| 132 |
+
agent_responses = [
|
| 133 |
+
(
|
| 134 |
+
"Darwin's theory of evolution states that species change over time through "
|
| 135 |
+
"natural selection. Individuals with traits better suited to their environment "
|
| 136 |
+
"survive and reproduce more, passing those traits to offspring. Over many "
|
| 137 |
+
"generations this leads to significant change in a population."
|
| 138 |
+
),
|
| 139 |
+
(
|
| 140 |
+
"No, I strongly disagree with that statement. Evolution is not intentional "
|
| 141 |
+
"or purposeful. Organisms do not decide to adapt. Changes happen through "
|
| 142 |
+
"random genetic mutations, and natural selection simply favours mutations "
|
| 143 |
+
"that improve survival and reproduction. There is no goal or direction."
|
| 144 |
+
),
|
| 145 |
+
(
|
| 146 |
+
"Natural selection works like a filter. Random mutations occur in a population. "
|
| 147 |
+
"Individuals whose mutations help them survive long enough to reproduce pass "
|
| 148 |
+
"those genes on. Over many generations the helpful traits become more common "
|
| 149 |
+
"in the population while harmful traits become rarer."
|
| 150 |
+
),
|
| 151 |
+
]
|
| 152 |
+
|
| 153 |
+
_reset("misconception_trap")
|
| 154 |
+
total = 0.0
|
| 155 |
+
turns = 0
|
| 156 |
+
|
| 157 |
+
for resp in agent_responses:
|
| 158 |
+
result = _step(resp)
|
| 159 |
+
total += result["reward"]["score"]
|
| 160 |
+
turns += 1
|
| 161 |
+
if result["done"]:
|
| 162 |
+
break
|
| 163 |
+
|
| 164 |
+
final_score = round(min(total / max(turns, 1), 1.0), 3)
|
| 165 |
+
return {
|
| 166 |
+
"task": "misconception_trap",
|
| 167 |
+
"difficulty": "hard",
|
| 168 |
+
"score": final_score,
|
| 169 |
+
"turns": turns,
|
| 170 |
+
"passed": final_score >= 0.5,
|
| 171 |
+
}
|
| 172 |
+
|
| 173 |
+
|
| 174 |
+
def run_all_graders() -> dict:
|
| 175 |
+
"""Run all 3 graders and return combined results."""
|
| 176 |
+
print("\nββ Running SocraticEnv Graders ββββββββββββββββββ")
|
| 177 |
+
|
| 178 |
+
results = {}
|
| 179 |
+
|
| 180 |
+
print(" [1/3] Grading: factual_recall (easy)...")
|
| 181 |
+
results["factual_recall"] = grade_factual_recall()
|
| 182 |
+
print(f" Score: {results['factual_recall']['score']} | Passed: {results['factual_recall']['passed']}")
|
| 183 |
+
|
| 184 |
+
print(" [2/3] Grading: socratic_dialogue (medium)...")
|
| 185 |
+
results["socratic_dialogue"] = grade_socratic_dialogue()
|
| 186 |
+
print(f" Score: {results['socratic_dialogue']['score']} | Passed: {results['socratic_dialogue']['passed']}")
|
| 187 |
+
|
| 188 |
+
print(" [3/3] Grading: misconception_trap (hard)...")
|
| 189 |
+
results["misconception_trap"] = grade_misconception_trap()
|
| 190 |
+
print(f" Score: {results['misconception_trap']['score']} | Passed: {results['misconception_trap']['passed']}")
|
| 191 |
+
|
| 192 |
+
all_scores = [r["score"] for r in results.values()]
|
| 193 |
+
overall = round(sum(all_scores) / len(all_scores), 3)
|
| 194 |
+
|
| 195 |
+
print(f"\nββ Overall Score: {overall} βββββββββββββββββββββββββ")
|
| 196 |
+
print(f"ββ All Passed: {all(r['passed'] for r in results.values())} ββ\n")
|
| 197 |
+
|
| 198 |
+
return {
|
| 199 |
+
"tasks": results,
|
| 200 |
+
"overall_score": overall,
|
| 201 |
+
"all_passed": all(r["passed"] for r in results.values()),
|
| 202 |
+
}
|
| 203 |
+
|
| 204 |
+
|
| 205 |
+
if __name__ == "__main__":
|
| 206 |
+
run_all_graders()
|
inference.py
ADDED
|
@@ -0,0 +1,162 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Inference Script β SocraticEnv
|
| 3 |
+
================================
|
| 4 |
+
MANDATORY variables (set in environment before running):
|
| 5 |
+
API_BASE_URL β The API endpoint for the LLM
|
| 6 |
+
MODEL_NAME β The model identifier to use
|
| 7 |
+
HF_TOKEN β Your HuggingFace token (used as API key)
|
| 8 |
+
|
| 9 |
+
Run:
|
| 10 |
+
python inference.py
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
import os
|
| 14 |
+
import time
|
| 15 |
+
import requests
|
| 16 |
+
from openai import OpenAI
|
| 17 |
+
from dotenv import load_dotenv
|
| 18 |
+
|
| 19 |
+
load_dotenv()
|
| 20 |
+
|
| 21 |
+
# ββ Config ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 22 |
+
API_BASE_URL = os.getenv("API_BASE_URL", "https://api-inference.huggingface.co/v1")
|
| 23 |
+
MODEL_NAME = os.getenv("MODEL_NAME", "mistralai/Mistral-7B-Instruct-v0.3")
|
| 24 |
+
HF_TOKEN = os.getenv("HF_TOKEN", "")
|
| 25 |
+
ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
|
| 26 |
+
|
| 27 |
+
MAX_TURNS = 10
|
| 28 |
+
TEMPERATURE = 0.3
|
| 29 |
+
|
| 30 |
+
client = OpenAI(
|
| 31 |
+
base_url=API_BASE_URL,
|
| 32 |
+
api_key=HF_TOKEN,
|
| 33 |
+
)
|
| 34 |
+
|
| 35 |
+
TASKS = ["factual_recall", "socratic_dialogue", "misconception_trap"]
|
| 36 |
+
|
| 37 |
+
SYSTEM_PROMPT = """You are an intelligent student in a Socratic dialogue with a tutor.
|
| 38 |
+
Your goals:
|
| 39 |
+
1. Answer questions clearly and accurately using correct terminology.
|
| 40 |
+
2. Show your reasoning β explain WHY, not just WHAT.
|
| 41 |
+
3. Be alert: if the tutor states something FALSE or misleading,
|
| 42 |
+
you must confidently disagree and explain the correct answer.
|
| 43 |
+
4. Stay engaged and thoughtful throughout the conversation.
|
| 44 |
+
Keep responses focused and between 3-6 sentences."""
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
def call_llm(messages: list) -> str:
|
| 48 |
+
"""Call the LLM and return its response text."""
|
| 49 |
+
try:
|
| 50 |
+
completion = client.chat.completions.create(
|
| 51 |
+
model=MODEL_NAME,
|
| 52 |
+
messages=messages,
|
| 53 |
+
max_tokens=300,
|
| 54 |
+
temperature=TEMPERATURE,
|
| 55 |
+
)
|
| 56 |
+
return completion.choices[0].message.content.strip()
|
| 57 |
+
except Exception as e:
|
| 58 |
+
print(f" [LLM ERROR] {e}")
|
| 59 |
+
return "I need to think about that more carefully before responding."
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
def reset_env(task_id: str) -> dict:
|
| 63 |
+
r = requests.post(f"{ENV_URL}/reset", json={"task_id": task_id})
|
| 64 |
+
r.raise_for_status()
|
| 65 |
+
return r.json()
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
def step_env(response: str) -> dict:
|
| 69 |
+
r = requests.post(f"{ENV_URL}/step", json={"response": response})
|
| 70 |
+
r.raise_for_status()
|
| 71 |
+
return r.json()
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
def run_task(task_id: str) -> dict:
|
| 75 |
+
"""Run one full episode of a task and return results."""
|
| 76 |
+
print(f"\nββ Task: {task_id} βββββββββββββββββββββββββββββββββ")
|
| 77 |
+
|
| 78 |
+
reset_data = reset_env(task_id)
|
| 79 |
+
obs = reset_data["observation"]
|
| 80 |
+
|
| 81 |
+
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
|
| 82 |
+
total_score = 0.0
|
| 83 |
+
turns = 0
|
| 84 |
+
|
| 85 |
+
print(f" Tutor: {obs['question'][:100]}...")
|
| 86 |
+
|
| 87 |
+
for _ in range(MAX_TURNS):
|
| 88 |
+
# Add tutor question to messages
|
| 89 |
+
messages.append({"role": "user", "content": obs["question"]})
|
| 90 |
+
|
| 91 |
+
# Get agent response from LLM
|
| 92 |
+
agent_response = call_llm(messages)
|
| 93 |
+
messages.append({"role": "assistant", "content": agent_response})
|
| 94 |
+
|
| 95 |
+
print(f" Agent (turn {turns+1}): {agent_response[:80]}...")
|
| 96 |
+
|
| 97 |
+
# Step the environment
|
| 98 |
+
result = step_env(agent_response)
|
| 99 |
+
reward = result["reward"]["score"]
|
| 100 |
+
total_score += reward
|
| 101 |
+
turns += 1
|
| 102 |
+
|
| 103 |
+
print(f" Reward: {reward:.3f} | Breakdown: {result['reward']['breakdown']}")
|
| 104 |
+
|
| 105 |
+
if result["done"]:
|
| 106 |
+
break
|
| 107 |
+
|
| 108 |
+
obs = result["observation"]
|
| 109 |
+
time.sleep(0.5) # be gentle with the API
|
| 110 |
+
|
| 111 |
+
final_score = round(min(total_score / max(turns, 1), 1.0), 3)
|
| 112 |
+
print(f" ββ Final Score: {final_score} ({'PASS' if final_score >= 0.5 else 'FAIL'})")
|
| 113 |
+
|
| 114 |
+
return {
|
| 115 |
+
"task": task_id,
|
| 116 |
+
"score": final_score,
|
| 117 |
+
"turns": turns,
|
| 118 |
+
"passed": final_score >= 0.5,
|
| 119 |
+
}
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
def main():
|
| 123 |
+
print("\nββββββββββββββββββββββββββββββββββββββββββββ")
|
| 124 |
+
print(" SocraticEnv β Baseline Inference Script")
|
| 125 |
+
print("ββββββββββββββββββββββββββββββββββββββββββββ")
|
| 126 |
+
print(f" Model: {MODEL_NAME}")
|
| 127 |
+
print(f" Env URL: {ENV_URL}")
|
| 128 |
+
print("ββββββββββββββββββββββββββββββββββββββββββββ")
|
| 129 |
+
|
| 130 |
+
# Check env is up
|
| 131 |
+
try:
|
| 132 |
+
r = requests.get(f"{ENV_URL}/ping")
|
| 133 |
+
r.raise_for_status()
|
| 134 |
+
print(" Env: ONLINE β")
|
| 135 |
+
except Exception:
|
| 136 |
+
print(" ERROR: Environment is not running!")
|
| 137 |
+
print(" Start it first with: python main.py")
|
| 138 |
+
return
|
| 139 |
+
|
| 140 |
+
results = {}
|
| 141 |
+
for task_id in TASKS:
|
| 142 |
+
results[task_id] = run_task(task_id)
|
| 143 |
+
time.sleep(1)
|
| 144 |
+
|
| 145 |
+
# Summary
|
| 146 |
+
print("\nββββββββββββββββββββββββββββββββββββββββοΏ½οΏ½βββ")
|
| 147 |
+
print(" RESULTS SUMMARY")
|
| 148 |
+
print("ββββββββββββββββββββββββββββββββββββββββββββ")
|
| 149 |
+
all_scores = []
|
| 150 |
+
for task_id, r in results.items():
|
| 151 |
+
status = "β PASS" if r["passed"] else "β FAIL"
|
| 152 |
+
print(f" {status} | {task_id:<25} | Score: {r['score']:.3f}")
|
| 153 |
+
all_scores.append(r["score"])
|
| 154 |
+
|
| 155 |
+
overall = round(sum(all_scores) / len(all_scores), 3)
|
| 156 |
+
print(f"\n Overall Score: {overall:.3f}")
|
| 157 |
+
print(f" All Passed: {all(r['passed'] for r in results.values())}")
|
| 158 |
+
print("ββββββββββββββββββββββββββββββββββββββββββββ\n")
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
if __name__ == "__main__":
|
| 162 |
+
main()
|
leaderboard.json
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"entries": [
|
| 3 |
+
{
|
| 4 |
+
"model_name": "Llama 3.1 8B (baseline)",
|
| 5 |
+
"factual_recall": 0.71,
|
| 6 |
+
"socratic_dialogue": 0.68,
|
| 7 |
+
"misconception_trap": 0.58,
|
| 8 |
+
"overall": 0.657,
|
| 9 |
+
"timestamp": "2026-04-06 17:10 UTC"
|
| 10 |
+
},
|
| 11 |
+
{
|
| 12 |
+
"model_name": "Random agent",
|
| 13 |
+
"factual_recall": 0.18,
|
| 14 |
+
"socratic_dialogue": 0.22,
|
| 15 |
+
"misconception_trap": 0.1,
|
| 16 |
+
"overall": 0.167,
|
| 17 |
+
"timestamp": "2026-04-06 17:10 UTC"
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"model_name": "Test Model pytest",
|
| 21 |
+
"factual_recall": 0.75,
|
| 22 |
+
"socratic_dialogue": 0.68,
|
| 23 |
+
"misconception_trap": 0.6,
|
| 24 |
+
"overall": 0.677,
|
| 25 |
+
"timestamp": "2026-04-07 13:24 UTC"
|
| 26 |
+
}
|
| 27 |
+
]
|
| 28 |
+
}
|
main.py
ADDED
|
@@ -0,0 +1,684 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from fastapi import FastAPI, HTTPException
|
| 2 |
+
from fastapi.middleware.cors import CORSMiddleware
|
| 3 |
+
from pydantic import BaseModel
|
| 4 |
+
from typing import Optional
|
| 5 |
+
from fastapi.staticfiles import StaticFiles
|
| 6 |
+
from openai import OpenAI
|
| 7 |
+
import os
|
| 8 |
+
from dotenv import load_dotenv
|
| 9 |
+
import json
|
| 10 |
+
from pathlib import Path
|
| 11 |
+
from datetime import datetime, timezone
|
| 12 |
+
load_dotenv()
|
| 13 |
+
import uvicorn
|
| 14 |
+
|
| 15 |
+
from environment import (
|
| 16 |
+
SocraticEnvironment,
|
| 17 |
+
Observation,
|
| 18 |
+
Action,
|
| 19 |
+
StepResult,
|
| 20 |
+
StateInfo,
|
| 21 |
+
)
|
| 22 |
+
|
| 23 |
+
# ββ App Setup βββββββββββββββββββββββββββββββββββββββββββββ
|
| 24 |
+
|
| 25 |
+
app = FastAPI(
|
| 26 |
+
title="SocraticEnv",
|
| 27 |
+
description="A Socratic teaching environment for the OpenEnv hackathon.",
|
| 28 |
+
version="1.0.0",
|
| 29 |
+
)
|
| 30 |
+
app.mount("/ui", StaticFiles(directory="static", html=True), name="static")
|
| 31 |
+
app.add_middleware(
|
| 32 |
+
CORSMiddleware,
|
| 33 |
+
allow_origins=["*"],
|
| 34 |
+
allow_methods=["*"],
|
| 35 |
+
allow_headers=["*"],
|
| 36 |
+
)
|
| 37 |
+
|
| 38 |
+
# One global environment instance
|
| 39 |
+
env = SocraticEnvironment()
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
# ββ Request / Response Models βββββββββββββββββββββββββββββ
|
| 43 |
+
|
| 44 |
+
class ResetRequest(BaseModel):
|
| 45 |
+
task_id: str = "factual_recall"
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
class StepRequest(BaseModel):
|
| 49 |
+
response: str
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
class TaskInfo(BaseModel):
|
| 53 |
+
id: str
|
| 54 |
+
name: str
|
| 55 |
+
difficulty: str
|
| 56 |
+
description: str
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
# ββ Routes ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 60 |
+
|
| 61 |
+
@app.get("/")
|
| 62 |
+
def root():
|
| 63 |
+
return {
|
| 64 |
+
"name": "SocraticEnv",
|
| 65 |
+
"version": "1.0.0",
|
| 66 |
+
"status": "running",
|
| 67 |
+
"description": "Socratic AI tutor environment β OpenEnv hackathon submission",
|
| 68 |
+
"endpoints": {
|
| 69 |
+
"reset": "POST /reset",
|
| 70 |
+
"step": "POST /step",
|
| 71 |
+
"state": "GET /state",
|
| 72 |
+
"tasks": "GET /tasks",
|
| 73 |
+
"ping": "GET /ping",
|
| 74 |
+
},
|
| 75 |
+
}
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
@app.get("/ping")
|
| 79 |
+
def ping():
|
| 80 |
+
"""Health check β used by HuggingFace and the validator."""
|
| 81 |
+
return {"status": "ok", "env": "SocraticEnv"}
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
@app.get("/tasks")
|
| 85 |
+
def list_tasks():
|
| 86 |
+
"""Return all available tasks."""
|
| 87 |
+
return {
|
| 88 |
+
"tasks": [
|
| 89 |
+
TaskInfo(
|
| 90 |
+
id="factual_recall",
|
| 91 |
+
name="Factual Recall",
|
| 92 |
+
difficulty="easy",
|
| 93 |
+
description=(
|
| 94 |
+
"Agent must explain a concept clearly and accurately. "
|
| 95 |
+
"Graded on key term coverage, substance, and ability "
|
| 96 |
+
"to reject a common misconception."
|
| 97 |
+
),
|
| 98 |
+
),
|
| 99 |
+
TaskInfo(
|
| 100 |
+
id="socratic_dialogue",
|
| 101 |
+
name="Socratic Dialogue",
|
| 102 |
+
difficulty="medium",
|
| 103 |
+
description=(
|
| 104 |
+
"Agent must engage in a 5-turn Socratic dialogue on a "
|
| 105 |
+
"philosophical or social topic. Graded on depth of "
|
| 106 |
+
"reasoning, use of evidence, and coherence."
|
| 107 |
+
),
|
| 108 |
+
),
|
| 109 |
+
TaskInfo(
|
| 110 |
+
id="misconception_trap",
|
| 111 |
+
name="Misconception Trap",
|
| 112 |
+
difficulty="hard",
|
| 113 |
+
description=(
|
| 114 |
+
"The tutor plants a false belief mid-dialogue. The agent "
|
| 115 |
+
"must detect it, correct it clearly, and explain why it "
|
| 116 |
+
"is wrong. Penalised for accepting the false claim."
|
| 117 |
+
),
|
| 118 |
+
),
|
| 119 |
+
TaskInfo(
|
| 120 |
+
id="debate_mode",
|
| 121 |
+
name="Debate Mode",
|
| 122 |
+
difficulty="medium",
|
| 123 |
+
description=(
|
| 124 |
+
"Agent must argue both sides of a controversial topic. "
|
| 125 |
+
"Graded on argument quality, use of evidence, "
|
| 126 |
+
"and clarity of position."
|
| 127 |
+
),
|
| 128 |
+
),
|
| 129 |
+
TaskInfo(
|
| 130 |
+
id="analogy_challenge",
|
| 131 |
+
name="Analogy Challenge",
|
| 132 |
+
difficulty="hard",
|
| 133 |
+
description=(
|
| 134 |
+
"Agent must explain complex concepts using ONLY everyday "
|
| 135 |
+
"analogies β no technical jargon allowed. "
|
| 136 |
+
"Penalised for using forbidden technical terms."
|
| 137 |
+
),
|
| 138 |
+
),
|
| 139 |
+
]
|
| 140 |
+
}
|
| 141 |
+
|
| 142 |
+
|
| 143 |
+
@app.post("/reset")
|
| 144 |
+
def reset(req: ResetRequest):
|
| 145 |
+
"""
|
| 146 |
+
Start a new episode for the given task.
|
| 147 |
+
Returns the first observation (tutor's opening question).
|
| 148 |
+
"""
|
| 149 |
+
valid_tasks = ["factual_recall", "socratic_dialogue", "misconception_trap", "debate_mode", "analogy_challenge"]
|
| 150 |
+
if req.task_id not in valid_tasks:
|
| 151 |
+
raise HTTPException(
|
| 152 |
+
status_code=400,
|
| 153 |
+
detail=f"Invalid task_id '{req.task_id}'. Choose from: {valid_tasks}",
|
| 154 |
+
)
|
| 155 |
+
try:
|
| 156 |
+
obs = env.reset(req.task_id)
|
| 157 |
+
return {
|
| 158 |
+
"observation": obs.model_dump(),
|
| 159 |
+
"message": f"Episode started for task: {req.task_id}",
|
| 160 |
+
}
|
| 161 |
+
except Exception as e:
|
| 162 |
+
raise HTTPException(status_code=500, detail=str(e))
|
| 163 |
+
|
| 164 |
+
|
| 165 |
+
@app.post("/step")
|
| 166 |
+
def step(req: StepRequest):
|
| 167 |
+
"""
|
| 168 |
+
Submit the agent's response and get the next observation + reward.
|
| 169 |
+
"""
|
| 170 |
+
if not req.response or not req.response.strip():
|
| 171 |
+
raise HTTPException(
|
| 172 |
+
status_code=400,
|
| 173 |
+
detail="Response cannot be empty.",
|
| 174 |
+
)
|
| 175 |
+
if env.done:
|
| 176 |
+
raise HTTPException(
|
| 177 |
+
status_code=400,
|
| 178 |
+
detail="Episode is finished. Call POST /reset to start a new one.",
|
| 179 |
+
)
|
| 180 |
+
try:
|
| 181 |
+
action = Action(response=req.response)
|
| 182 |
+
result = env.step(action)
|
| 183 |
+
return result.model_dump()
|
| 184 |
+
except Exception as e:
|
| 185 |
+
raise HTTPException(status_code=500, detail=str(e))
|
| 186 |
+
|
| 187 |
+
|
| 188 |
+
@app.get("/state")
|
| 189 |
+
def state():
|
| 190 |
+
"""Return the current state of the environment."""
|
| 191 |
+
return env.state().model_dump()
|
| 192 |
+
|
| 193 |
+
class InferenceRequest(BaseModel):
|
| 194 |
+
message: str
|
| 195 |
+
history: list = []
|
| 196 |
+
|
| 197 |
+
@app.post("/inference")
|
| 198 |
+
async def run_inference(req: InferenceRequest):
|
| 199 |
+
"""
|
| 200 |
+
Call the LLM to generate a student response.
|
| 201 |
+
Used by the UI for live Auto-Run demos.
|
| 202 |
+
"""
|
| 203 |
+
api_base = os.getenv("API_BASE_URL", "").strip()
|
| 204 |
+
hf_token = os.getenv("HF_TOKEN", "").strip()
|
| 205 |
+
model = os.getenv("MODEL_NAME", "").strip()
|
| 206 |
+
|
| 207 |
+
# Debug: confirm env vars are loaded
|
| 208 |
+
if not hf_token:
|
| 209 |
+
return {"response": "ERROR: HF_TOKEN not set in environment secrets.", "model": "none"}
|
| 210 |
+
if not api_base:
|
| 211 |
+
return {"response": "ERROR: API_BASE_URL not set in environment secrets.", "model": "none"}
|
| 212 |
+
if not model:
|
| 213 |
+
return {"response": "ERROR: MODEL_NAME not set in environment secrets.", "model": "none"}
|
| 214 |
+
|
| 215 |
+
try:
|
| 216 |
+
client = OpenAI(base_url=api_base, api_key=hf_token)
|
| 217 |
+
|
| 218 |
+
messages = [
|
| 219 |
+
{
|
| 220 |
+
"role": "system",
|
| 221 |
+
"content": (
|
| 222 |
+
"You are an intelligent student in a Socratic dialogue with a tutor. "
|
| 223 |
+
"Answer questions clearly and accurately using correct terminology. "
|
| 224 |
+
"Show your reasoning. IMPORTANT: If the tutor states something FALSE "
|
| 225 |
+
"or misleading, you must confidently disagree and explain the correct answer. "
|
| 226 |
+
"Keep responses focused and between 3-6 sentences."
|
| 227 |
+
)
|
| 228 |
+
}
|
| 229 |
+
]
|
| 230 |
+
|
| 231 |
+
for h in req.history:
|
| 232 |
+
messages.append({
|
| 233 |
+
"role": "user" if h["role"] == "tutor" else "assistant",
|
| 234 |
+
"content": h["content"]
|
| 235 |
+
})
|
| 236 |
+
|
| 237 |
+
messages.append({"role": "user", "content": req.message})
|
| 238 |
+
|
| 239 |
+
completion = client.chat.completions.create(
|
| 240 |
+
model=model,
|
| 241 |
+
messages=messages,
|
| 242 |
+
max_tokens=300,
|
| 243 |
+
temperature=0.3,
|
| 244 |
+
)
|
| 245 |
+
response = completion.choices[0].message.content.strip()
|
| 246 |
+
return {"response": response, "model": model}
|
| 247 |
+
|
| 248 |
+
|
| 249 |
+
except Exception as e:
|
| 250 |
+
return {"response": f"ERROR: {str(e)}", "model": "failed"}
|
| 251 |
+
|
| 252 |
+
# ββ OpenEnv Validator Required Endpoints βββββββββββββββββ
|
| 253 |
+
|
| 254 |
+
@app.get("/health")
|
| 255 |
+
def health():
|
| 256 |
+
"""Required by openenv validate."""
|
| 257 |
+
return {
|
| 258 |
+
"status": "healthy",
|
| 259 |
+
"version": "1.0.0",
|
| 260 |
+
"environment": "SocraticEnv",
|
| 261 |
+
}
|
| 262 |
+
|
| 263 |
+
|
| 264 |
+
@app.get("/metadata")
|
| 265 |
+
def metadata():
|
| 266 |
+
"""Required by openenv validate."""
|
| 267 |
+
return {
|
| 268 |
+
"name": "SocraticEnv",
|
| 269 |
+
"description": (
|
| 270 |
+
"A Socratic teaching environment where an AI agent plays the role "
|
| 271 |
+
"of a student. The environment acts as a tutor that asks probing "
|
| 272 |
+
"questions, plants misconceptions, and evaluates reasoning quality."
|
| 273 |
+
),
|
| 274 |
+
"version": "1.0.0",
|
| 275 |
+
"author": "Amar Prakash",
|
| 276 |
+
"tags": ["openenv", "education", "reasoning", "socratic"],
|
| 277 |
+
}
|
| 278 |
+
|
| 279 |
+
|
| 280 |
+
@app.get("/schema")
|
| 281 |
+
def schema():
|
| 282 |
+
"""Required by openenv validate."""
|
| 283 |
+
return {
|
| 284 |
+
"action": {
|
| 285 |
+
"type": "object",
|
| 286 |
+
"properties": {
|
| 287 |
+
"response": {
|
| 288 |
+
"type": "string",
|
| 289 |
+
"description": "The agent's reply to the tutor's question",
|
| 290 |
+
}
|
| 291 |
+
},
|
| 292 |
+
"required": ["response"],
|
| 293 |
+
},
|
| 294 |
+
"observation": {
|
| 295 |
+
"type": "object",
|
| 296 |
+
"properties": {
|
| 297 |
+
"question": {
|
| 298 |
+
"type": "string",
|
| 299 |
+
"description": "The tutor's current question or statement",
|
| 300 |
+
},
|
| 301 |
+
"turn": {"type": "integer", "description": "Current turn number"},
|
| 302 |
+
"task_id": {"type": "string", "description": "Which task is running"},
|
| 303 |
+
"context": {"type": "string", "description": "Topic context"},
|
| 304 |
+
"hint": {"type": "string", "description": "Optional hint"},
|
| 305 |
+
},
|
| 306 |
+
"required": ["question", "turn", "task_id"],
|
| 307 |
+
},
|
| 308 |
+
"state": {
|
| 309 |
+
"type": "object",
|
| 310 |
+
"properties": {
|
| 311 |
+
"task_id": {"type": "string"},
|
| 312 |
+
"turn": {"type": "integer"},
|
| 313 |
+
"max_turns": {"type": "integer"},
|
| 314 |
+
"total_score": {"type": "number"},
|
| 315 |
+
"history": {"type": "array"},
|
| 316 |
+
"done": {"type": "boolean"},
|
| 317 |
+
},
|
| 318 |
+
},
|
| 319 |
+
}
|
| 320 |
+
|
| 321 |
+
|
| 322 |
+
@app.post("/mcp")
|
| 323 |
+
def mcp(request: dict):
|
| 324 |
+
"""
|
| 325 |
+
MCP (Model Context Protocol) endpoint.
|
| 326 |
+
Required by openenv validate.
|
| 327 |
+
Returns JSON-RPC 2.0 compliant response.
|
| 328 |
+
"""
|
| 329 |
+
method = request.get("method", "")
|
| 330 |
+
req_id = request.get("id", 1)
|
| 331 |
+
jsonrpc = "2.0"
|
| 332 |
+
|
| 333 |
+
if method == "initialize":
|
| 334 |
+
return {
|
| 335 |
+
"jsonrpc": jsonrpc, "id": req_id,
|
| 336 |
+
"result": {
|
| 337 |
+
"name": "SocraticEnv",
|
| 338 |
+
"version": "1.0.0",
|
| 339 |
+
"description": "Socratic AI tutor OpenEnv environment",
|
| 340 |
+
"capabilities": {
|
| 341 |
+
"tasks": True,
|
| 342 |
+
"reset": True,
|
| 343 |
+
"step": True,
|
| 344 |
+
"state": True,
|
| 345 |
+
"schema": True,
|
| 346 |
+
"health": True,
|
| 347 |
+
},
|
| 348 |
+
},
|
| 349 |
+
}
|
| 350 |
+
|
| 351 |
+
if method == "tasks/list":
|
| 352 |
+
return {
|
| 353 |
+
"jsonrpc": jsonrpc, "id": req_id,
|
| 354 |
+
"result": {
|
| 355 |
+
"tasks": [
|
| 356 |
+
{"id": "factual_recall", "difficulty": "easy"},
|
| 357 |
+
{"id": "socratic_dialogue", "difficulty": "medium"},
|
| 358 |
+
{"id": "misconception_trap","difficulty": "hard"},
|
| 359 |
+
]
|
| 360 |
+
},
|
| 361 |
+
}
|
| 362 |
+
|
| 363 |
+
# Default response for any other method
|
| 364 |
+
return {
|
| 365 |
+
"jsonrpc": jsonrpc, "id": req_id,
|
| 366 |
+
"result": {"status": "ok", "method": method},
|
| 367 |
+
}
|
| 368 |
+
|
| 369 |
+
from fastapi.responses import RedirectResponse
|
| 370 |
+
|
| 371 |
+
@app.get("/leaderboard-ui")
|
| 372 |
+
def leaderboard_ui():
|
| 373 |
+
"""Redirect to the leaderboard UI page."""
|
| 374 |
+
return RedirectResponse(url="/ui/leaderboard.html")
|
| 375 |
+
|
| 376 |
+
# ββ Leaderboard βββββββββββββββββββββββββββββββββββββββββββ
|
| 377 |
+
|
| 378 |
+
LEADERBOARD_FILE = Path("leaderboard.json")
|
| 379 |
+
|
| 380 |
+
def load_leaderboard() -> dict:
|
| 381 |
+
try:
|
| 382 |
+
if LEADERBOARD_FILE.exists():
|
| 383 |
+
with open(LEADERBOARD_FILE, "r") as f:
|
| 384 |
+
return json.load(f)
|
| 385 |
+
except Exception:
|
| 386 |
+
pass
|
| 387 |
+
return {"entries": []}
|
| 388 |
+
|
| 389 |
+
def save_leaderboard(data: dict):
|
| 390 |
+
with open(LEADERBOARD_FILE, "w") as f:
|
| 391 |
+
json.dump(data, f, indent=2)
|
| 392 |
+
|
| 393 |
+
class LeaderboardEntry(BaseModel):
|
| 394 |
+
model_name: str
|
| 395 |
+
factual_recall: float
|
| 396 |
+
socratic_dialogue: float
|
| 397 |
+
misconception_trap: float
|
| 398 |
+
overall: float
|
| 399 |
+
timestamp: str = ""
|
| 400 |
+
|
| 401 |
+
@app.get("/leaderboard")
|
| 402 |
+
def get_leaderboard():
|
| 403 |
+
"""Return all leaderboard entries sorted by overall score."""
|
| 404 |
+
data = load_leaderboard()
|
| 405 |
+
entries = sorted(
|
| 406 |
+
data["entries"],
|
| 407 |
+
key=lambda x: x["overall"],
|
| 408 |
+
reverse=True
|
| 409 |
+
)
|
| 410 |
+
return {"entries": entries, "total": len(entries)}
|
| 411 |
+
|
| 412 |
+
@app.post("/leaderboard")
|
| 413 |
+
def add_leaderboard_entry(entry: LeaderboardEntry):
|
| 414 |
+
"""Add or update a model's score on the leaderboard."""
|
| 415 |
+
data = load_leaderboard()
|
| 416 |
+
entry.timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M UTC")
|
| 417 |
+
|
| 418 |
+
# Update if model already exists, otherwise add
|
| 419 |
+
existing = [e for e in data["entries"] if e["model_name"] == entry.model_name]
|
| 420 |
+
if existing:
|
| 421 |
+
for e in data["entries"]:
|
| 422 |
+
if e["model_name"] == entry.model_name:
|
| 423 |
+
e.update(entry.model_dump())
|
| 424 |
+
else:
|
| 425 |
+
data["entries"].append(entry.model_dump())
|
| 426 |
+
|
| 427 |
+
save_leaderboard(data)
|
| 428 |
+
return {"success": True, "entry": entry.model_dump()}
|
| 429 |
+
|
| 430 |
+
@app.delete("/leaderboard/{model_name}")
|
| 431 |
+
def delete_leaderboard_entry(model_name: str):
|
| 432 |
+
"""Remove a model from the leaderboard."""
|
| 433 |
+
data = load_leaderboard()
|
| 434 |
+
data["entries"] = [
|
| 435 |
+
e for e in data["entries"]
|
| 436 |
+
if e["model_name"] != model_name
|
| 437 |
+
]
|
| 438 |
+
save_leaderboard(data)
|
| 439 |
+
return {"success": True}
|
| 440 |
+
|
| 441 |
+
@app.post("/leaderboard/run")
|
| 442 |
+
async def run_leaderboard_evaluation(request: dict):
|
| 443 |
+
"""
|
| 444 |
+
Run a full evaluation of a model across all 3 tasks
|
| 445 |
+
and automatically save to leaderboard.
|
| 446 |
+
"""
|
| 447 |
+
model_name = request.get("model_name", "Unknown Model")
|
| 448 |
+
|
| 449 |
+
scores = {}
|
| 450 |
+
task_ids = ["factual_recall", "socratic_dialogue", "misconception_trap"]
|
| 451 |
+
|
| 452 |
+
api_base = os.getenv("API_BASE_URL", "").strip()
|
| 453 |
+
hf_token = os.getenv("HF_TOKEN", "").strip()
|
| 454 |
+
model = os.getenv("MODEL_NAME", "").strip()
|
| 455 |
+
|
| 456 |
+
if not hf_token or not api_base or not model:
|
| 457 |
+
return {"error": "API credentials not configured in environment secrets."}
|
| 458 |
+
|
| 459 |
+
try:
|
| 460 |
+
client = OpenAI(base_url=api_base, api_key=hf_token)
|
| 461 |
+
|
| 462 |
+
system_prompt = (
|
| 463 |
+
"You are an intelligent student in a Socratic dialogue. "
|
| 464 |
+
"Answer accurately using correct terminology. Show reasoning. "
|
| 465 |
+
"If the tutor states something FALSE, confidently disagree and correct it. "
|
| 466 |
+
"Keep responses to 3-5 sentences."
|
| 467 |
+
)
|
| 468 |
+
|
| 469 |
+
for task_id in task_ids:
|
| 470 |
+
# Reset environment
|
| 471 |
+
obs = env.reset(task_id)
|
| 472 |
+
total = 0.0
|
| 473 |
+
turns = 0
|
| 474 |
+
messages = [{"role": "system", "content": system_prompt}]
|
| 475 |
+
|
| 476 |
+
for _ in range(10):
|
| 477 |
+
messages.append({"role": "user", "content": obs.question})
|
| 478 |
+
try:
|
| 479 |
+
completion = client.chat.completions.create(
|
| 480 |
+
model=model,
|
| 481 |
+
messages=messages,
|
| 482 |
+
max_tokens=250,
|
| 483 |
+
temperature=0.3,
|
| 484 |
+
)
|
| 485 |
+
response = completion.choices[0].message.content.strip()
|
| 486 |
+
except Exception as e:
|
| 487 |
+
response = "I need to think carefully about this."
|
| 488 |
+
|
| 489 |
+
messages.append({"role": "assistant", "content": response})
|
| 490 |
+
action = Action(response=response)
|
| 491 |
+
result = env.step(action)
|
| 492 |
+
total += result.reward.score
|
| 493 |
+
turns += 1
|
| 494 |
+
|
| 495 |
+
if result.done:
|
| 496 |
+
break
|
| 497 |
+
obs = result.observation
|
| 498 |
+
|
| 499 |
+
scores[task_id] = round(min(total / max(turns, 1), 1.0), 3)
|
| 500 |
+
|
| 501 |
+
overall = round(sum(scores.values()) / len(scores), 3)
|
| 502 |
+
|
| 503 |
+
# Save to leaderboard
|
| 504 |
+
entry = LeaderboardEntry(
|
| 505 |
+
model_name=model_name,
|
| 506 |
+
factual_recall=scores["factual_recall"],
|
| 507 |
+
socratic_dialogue=scores["socratic_dialogue"],
|
| 508 |
+
misconception_trap=scores["misconception_trap"],
|
| 509 |
+
overall=overall,
|
| 510 |
+
)
|
| 511 |
+
entry.timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M UTC")
|
| 512 |
+
data = load_leaderboard()
|
| 513 |
+
existing = [e for e in data["entries"] if e["model_name"] == model_name]
|
| 514 |
+
if existing:
|
| 515 |
+
for e in data["entries"]:
|
| 516 |
+
if e["model_name"] == entry.model_name:
|
| 517 |
+
e.update(entry.model_dump())
|
| 518 |
+
else:
|
| 519 |
+
data["entries"].append(entry.model_dump())
|
| 520 |
+
save_leaderboard(data)
|
| 521 |
+
|
| 522 |
+
return {
|
| 523 |
+
"success": True,
|
| 524 |
+
"model_name": model_name,
|
| 525 |
+
"scores": scores,
|
| 526 |
+
"overall": overall,
|
| 527 |
+
}
|
| 528 |
+
|
| 529 |
+
except Exception as e:
|
| 530 |
+
return {"error": str(e)}
|
| 531 |
+
|
| 532 |
+
# ββ Adaptive Task Generator βββββββββββββββββββββββββββββββ
|
| 533 |
+
|
| 534 |
+
class GenerateTaskRequest(BaseModel):
|
| 535 |
+
topic: str
|
| 536 |
+
difficulty: str = "medium" # easy, medium, hard
|
| 537 |
+
|
| 538 |
+
@app.post("/generate_task")
|
| 539 |
+
async def generate_task(req: GenerateTaskRequest):
|
| 540 |
+
"""
|
| 541 |
+
Use an LLM to generate a brand new Socratic task on any topic.
|
| 542 |
+
Makes the environment infinitely replayable.
|
| 543 |
+
"""
|
| 544 |
+
api_base = os.getenv("API_BASE_URL", "").strip()
|
| 545 |
+
hf_token = os.getenv("HF_TOKEN", "").strip()
|
| 546 |
+
model = os.getenv("MODEL_NAME", "").strip()
|
| 547 |
+
|
| 548 |
+
if not hf_token or not api_base or not model:
|
| 549 |
+
return {"error": "API credentials not configured."}
|
| 550 |
+
|
| 551 |
+
difficulty_instructions = {
|
| 552 |
+
"easy": (
|
| 553 |
+
"Generate a simple factual question about the topic. "
|
| 554 |
+
"Then generate 2 follow-up questions that go slightly deeper. "
|
| 555 |
+
"Finally generate a common misconception about this topic as a statement."
|
| 556 |
+
),
|
| 557 |
+
"medium": (
|
| 558 |
+
"Generate an open-ended philosophical or analytical question about the topic "
|
| 559 |
+
"that requires reasoning, not just facts. "
|
| 560 |
+
"Then generate 4 probing follow-up questions that challenge the student's thinking."
|
| 561 |
+
),
|
| 562 |
+
"hard": (
|
| 563 |
+
"Generate an overview question about the topic. "
|
| 564 |
+
"Then generate a confident but FALSE statement about the topic "
|
| 565 |
+
"that sounds plausible but is actually wrong. "
|
| 566 |
+
"This will be used to test if an AI can detect the misconception."
|
| 567 |
+
),
|
| 568 |
+
}
|
| 569 |
+
|
| 570 |
+
prompt = f"""You are designing a Socratic tutoring session about: "{req.topic}"
|
| 571 |
+
|
| 572 |
+
{difficulty_instructions[req.difficulty]}
|
| 573 |
+
|
| 574 |
+
Respond ONLY with valid JSON in exactly this format, no other text:
|
| 575 |
+
|
| 576 |
+
For easy difficulty:
|
| 577 |
+
{{
|
| 578 |
+
"concept": "{req.topic}",
|
| 579 |
+
"opening": "your opening question here",
|
| 580 |
+
"follow_up": "your follow-up question here",
|
| 581 |
+
"common_misconception": "your misconception statement here",
|
| 582 |
+
"key_terms": ["term1", "term2", "term3", "term4"]
|
| 583 |
+
}}
|
| 584 |
+
|
| 585 |
+
For medium difficulty:
|
| 586 |
+
{{
|
| 587 |
+
"topic": "{req.topic}",
|
| 588 |
+
"turns": [
|
| 589 |
+
"question 1",
|
| 590 |
+
"question 2",
|
| 591 |
+
"question 3",
|
| 592 |
+
"question 4",
|
| 593 |
+
"question 5"
|
| 594 |
+
]
|
| 595 |
+
}}
|
| 596 |
+
|
| 597 |
+
For hard difficulty:
|
| 598 |
+
{{
|
| 599 |
+
"subject": "{req.topic}",
|
| 600 |
+
"setup": "your overview question here",
|
| 601 |
+
"trap_statement": "your false statement here",
|
| 602 |
+
"correct_response_keywords": ["keyword1", "keyword2", "keyword3"],
|
| 603 |
+
"explanation": "explanation of why the statement is false",
|
| 604 |
+
"follow_up_after_correction": "your follow-up question after correction"
|
| 605 |
+
}}
|
| 606 |
+
|
| 607 |
+
Generate for {req.difficulty} difficulty now:"""
|
| 608 |
+
|
| 609 |
+
try:
|
| 610 |
+
client = OpenAI(base_url=api_base, api_key=hf_token)
|
| 611 |
+
completion = client.chat.completions.create(
|
| 612 |
+
model=model,
|
| 613 |
+
messages=[
|
| 614 |
+
{
|
| 615 |
+
"role": "system",
|
| 616 |
+
"content": "You are a JSON generator. Output only valid JSON, no markdown, no explanation."
|
| 617 |
+
},
|
| 618 |
+
{"role": "user", "content": prompt}
|
| 619 |
+
],
|
| 620 |
+
max_tokens=600,
|
| 621 |
+
temperature=0.7,
|
| 622 |
+
)
|
| 623 |
+
|
| 624 |
+
raw = completion.choices[0].message.content.strip()
|
| 625 |
+
|
| 626 |
+
# Clean up markdown code blocks if model adds them
|
| 627 |
+
raw = raw.replace("```json", "").replace("```", "").strip()
|
| 628 |
+
|
| 629 |
+
task_data = json.loads(raw)
|
| 630 |
+
task_data["_generated"] = True
|
| 631 |
+
task_data["_topic"] = req.topic
|
| 632 |
+
task_data["_difficulty"] = req.difficulty
|
| 633 |
+
|
| 634 |
+
# Inject into environment's question banks
|
| 635 |
+
if req.difficulty == "easy":
|
| 636 |
+
from environment import FACTUAL_TOPICS
|
| 637 |
+
# Ensure required fields exist
|
| 638 |
+
if "key_terms" not in task_data:
|
| 639 |
+
task_data["key_terms"] = [req.topic]
|
| 640 |
+
FACTUAL_TOPICS.insert(0, task_data)
|
| 641 |
+
return {
|
| 642 |
+
"success": True,
|
| 643 |
+
"task_id": "factual_recall",
|
| 644 |
+
"difficulty": "easy",
|
| 645 |
+
"topic": req.topic,
|
| 646 |
+
"preview": task_data.get("opening", ""),
|
| 647 |
+
"message": f"Generated new easy task about '{req.topic}'. Start a factual_recall episode to use it.",
|
| 648 |
+
}
|
| 649 |
+
|
| 650 |
+
elif req.difficulty == "medium":
|
| 651 |
+
from environment import SOCRATIC_DIALOGUES
|
| 652 |
+
SOCRATIC_DIALOGUES.insert(0, task_data)
|
| 653 |
+
return {
|
| 654 |
+
"success": True,
|
| 655 |
+
"task_id": "socratic_dialogue",
|
| 656 |
+
"difficulty": "medium",
|
| 657 |
+
"topic": req.topic,
|
| 658 |
+
"preview": task_data.get("turns", [""])[0],
|
| 659 |
+
"message": f"Generated new medium task about '{req.topic}'. Start a socratic_dialogue episode to use it.",
|
| 660 |
+
}
|
| 661 |
+
|
| 662 |
+
elif req.difficulty == "hard":
|
| 663 |
+
from environment import MISCONCEPTION_TRAPS
|
| 664 |
+
if "correct_response_keywords" not in task_data:
|
| 665 |
+
task_data["correct_response_keywords"] = ["wrong", "incorrect", "false"]
|
| 666 |
+
MISCONCEPTION_TRAPS.insert(0, task_data)
|
| 667 |
+
return {
|
| 668 |
+
"success": True,
|
| 669 |
+
"task_id": "misconception_trap",
|
| 670 |
+
"difficulty": "hard",
|
| 671 |
+
"topic": req.topic,
|
| 672 |
+
"preview": task_data.get("setup", ""),
|
| 673 |
+
"message": f"Generated new hard task about '{req.topic}'. Start a misconception_trap episode to use it.",
|
| 674 |
+
}
|
| 675 |
+
|
| 676 |
+
except json.JSONDecodeError as e:
|
| 677 |
+
return {"error": f"LLM returned invalid JSON: {str(e)}", "raw": raw}
|
| 678 |
+
except Exception as e:
|
| 679 |
+
return {"error": str(e)}
|
| 680 |
+
|
| 681 |
+
# ββ Entry Point βββββββββββββββββββββββββββββββββββββββββββ
|
| 682 |
+
|
| 683 |
+
if __name__ == "__main__":
|
| 684 |
+
uvicorn.run("main:app", host="0.0.0.0", port=7860, reload=False)
|
openenv.yaml
ADDED
|
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
name: SocraticEnv
|
| 2 |
+
version: "1.0.0"
|
| 3 |
+
description: >
|
| 4 |
+
A Socratic teaching environment where an AI agent plays the role
|
| 5 |
+
of a student. The environment acts as a tutor that asks probing
|
| 6 |
+
questions, plants misconceptions, and evaluates reasoning quality.
|
| 7 |
+
Tests factual recall, multi-turn coherence, and critical thinking.
|
| 8 |
+
author: Amar Prakash
|
| 9 |
+
tags:
|
| 10 |
+
- openenv
|
| 11 |
+
- education
|
| 12 |
+
- reasoning
|
| 13 |
+
- socratic
|
| 14 |
+
- llm-evaluation
|
| 15 |
+
observation_space:
|
| 16 |
+
type: text
|
| 17 |
+
description: A question or statement from the Socratic tutor
|
| 18 |
+
action_space:
|
| 19 |
+
type: text
|
| 20 |
+
description: The agent's response to the tutor's question
|
| 21 |
+
reward_range: [0.0, 1.0]
|
| 22 |
+
tasks:
|
| 23 |
+
- id: factual_recall
|
| 24 |
+
name: Factual Recall
|
| 25 |
+
difficulty: easy
|
| 26 |
+
description: Agent must explain a concept clearly and accurately
|
| 27 |
+
- id: socratic_dialogue
|
| 28 |
+
name: Socratic Dialogue
|
| 29 |
+
difficulty: medium
|
| 30 |
+
description: Agent must stay coherent across a 5-turn Socratic dialogue
|
| 31 |
+
- id: misconception_trap
|
| 32 |
+
name: Misconception Trap
|
| 33 |
+
difficulty: hard
|
| 34 |
+
description: Agent must detect and correct a false belief planted by the tutor
|
| 35 |
+
- id: debate_mode
|
| 36 |
+
name: Debate Mode
|
| 37 |
+
difficulty: medium
|
| 38 |
+
description: Agent must argue both sides of a controversial topic
|
| 39 |
+
- id: analogy_challenge
|
| 40 |
+
name: Analogy Challenge
|
| 41 |
+
difficulty: hard
|
| 42 |
+
description: Agent must explain concepts using only everyday analogies
|
| 43 |
+
endpoints:
|
| 44 |
+
reset: POST /reset
|
| 45 |
+
step: POST /step
|
| 46 |
+
state: GET /state
|
| 47 |
+
tasks: GET /tasks
|
requirements.txt
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
fastapi==0.109.0
|
| 2 |
+
uvicorn==0.27.0
|
| 3 |
+
pydantic==2.5.3
|
| 4 |
+
openai==1.12.0
|
| 5 |
+
python-dotenv==1.0.0
|
| 6 |
+
requests==2.31.0
|
| 7 |
+
pytest==7.4.4
|
| 8 |
+
httpx==0.26.0
|
static/index.html
ADDED
|
@@ -0,0 +1,850 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!DOCTYPE html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="UTF-8" />
|
| 5 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
|
| 6 |
+
<title>SocraticEnv β Live Dashboard</title>
|
| 7 |
+
<style>
|
| 8 |
+
* { margin: 0; padding: 0; box-sizing: border-box; }
|
| 9 |
+
body {
|
| 10 |
+
font-family: 'Segoe UI', system-ui, sans-serif;
|
| 11 |
+
background: #0d1117; color: #e6edf3; min-height: 100vh;
|
| 12 |
+
}
|
| 13 |
+
.header {
|
| 14 |
+
background: #161b22; border-bottom: 1px solid #30363d;
|
| 15 |
+
padding: 16px 32px; display: flex; align-items: center;
|
| 16 |
+
justify-content: space-between;
|
| 17 |
+
}
|
| 18 |
+
.header-left { display: flex; align-items: center; gap: 12px; }
|
| 19 |
+
.logo {
|
| 20 |
+
width: 36px; height: 36px;
|
| 21 |
+
background: linear-gradient(135deg, #7c3aed, #a855f7);
|
| 22 |
+
border-radius: 8px; display: flex; align-items: center;
|
| 23 |
+
justify-content: center; font-size: 18px;
|
| 24 |
+
}
|
| 25 |
+
.header h1 { font-size: 18px; font-weight: 600; color: #e6edf3; }
|
| 26 |
+
.header p { font-size: 12px; color: #8b949e; margin-top: 2px; }
|
| 27 |
+
.header-right { display: flex; align-items: center; gap: 10px; }
|
| 28 |
+
.nav-link {
|
| 29 |
+
padding: 6px 14px; border-radius: 8px; font-size: 12px;
|
| 30 |
+
font-weight: 600; text-decoration: none; border: 1px solid #30363d;
|
| 31 |
+
color: #8b949e; background: #21262d; transition: all 0.2s;
|
| 32 |
+
}
|
| 33 |
+
.nav-link:hover { color: #e6edf3; border-color: #7c3aed; }
|
| 34 |
+
.nav-link.active { color: #a855f7; border-color: #7c3aed; background: #13111e; }
|
| 35 |
+
.status-badge {
|
| 36 |
+
display: flex; align-items: center; gap: 6px;
|
| 37 |
+
background: #1a2332; border: 1px solid #30363d;
|
| 38 |
+
border-radius: 20px; padding: 6px 14px;
|
| 39 |
+
font-size: 12px; color: #8b949e;
|
| 40 |
+
}
|
| 41 |
+
.status-dot {
|
| 42 |
+
width: 8px; height: 8px; border-radius: 50%;
|
| 43 |
+
background: #3fb950; box-shadow: 0 0 6px #3fb950;
|
| 44 |
+
animation: pulse 2s infinite;
|
| 45 |
+
}
|
| 46 |
+
.status-dot.offline { background: #f85149; box-shadow: 0 0 6px #f85149; animation: none; }
|
| 47 |
+
@keyframes pulse { 0%,100%{opacity:1} 50%{opacity:0.5} }
|
| 48 |
+
.container {
|
| 49 |
+
display: grid; grid-template-columns: 300px 1fr;
|
| 50 |
+
height: calc(100vh - 69px);
|
| 51 |
+
}
|
| 52 |
+
.sidebar {
|
| 53 |
+
background: #161b22; border-right: 1px solid #30363d;
|
| 54 |
+
padding: 20px; overflow-y: auto;
|
| 55 |
+
}
|
| 56 |
+
.sidebar-section { margin-bottom: 24px; }
|
| 57 |
+
.sidebar-title {
|
| 58 |
+
font-size: 11px; font-weight: 600; color: #8b949e;
|
| 59 |
+
letter-spacing: 1px; text-transform: uppercase; margin-bottom: 12px;
|
| 60 |
+
}
|
| 61 |
+
.task-card {
|
| 62 |
+
background: #0d1117; border: 1px solid #30363d;
|
| 63 |
+
border-radius: 10px; padding: 14px; margin-bottom: 8px;
|
| 64 |
+
cursor: pointer; transition: all 0.2s;
|
| 65 |
+
}
|
| 66 |
+
.task-card:hover { border-color: #7c3aed; background: #13111e; }
|
| 67 |
+
.task-card.active {
|
| 68 |
+
border-color: #7c3aed; background: #13111e;
|
| 69 |
+
box-shadow: 0 0 0 1px #7c3aed22;
|
| 70 |
+
}
|
| 71 |
+
.task-header {
|
| 72 |
+
display: flex; align-items: center;
|
| 73 |
+
justify-content: space-between; margin-bottom: 6px;
|
| 74 |
+
}
|
| 75 |
+
.task-name { font-size: 13px; font-weight: 600; color: #e6edf3; }
|
| 76 |
+
.difficulty {
|
| 77 |
+
font-size: 10px; font-weight: 600; padding: 2px 8px;
|
| 78 |
+
border-radius: 10px; text-transform: uppercase; letter-spacing: 0.5px;
|
| 79 |
+
}
|
| 80 |
+
.easy { background: #1a3a2a; color: #3fb950; border: 1px solid #3fb95040; }
|
| 81 |
+
.medium { background: #332d1a; color: #d29922; border: 1px solid #d2992240; }
|
| 82 |
+
.hard { background: #3a1a1a; color: #f85149; border: 1px solid #f8514940; }
|
| 83 |
+
.task-desc { font-size: 11px; color: #8b949e; line-height: 1.5; }
|
| 84 |
+
.score-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 8px; }
|
| 85 |
+
.score-card {
|
| 86 |
+
background: #0d1117; border: 1px solid #30363d;
|
| 87 |
+
border-radius: 8px; padding: 12px; text-align: center;
|
| 88 |
+
}
|
| 89 |
+
.score-value { font-size: 22px; font-weight: 700; color: #7c3aed; }
|
| 90 |
+
.score-label { font-size: 10px; color: #8b949e; margin-top: 2px; }
|
| 91 |
+
.score-card.full { grid-column: 1 / 3; }
|
| 92 |
+
.score-card.full .score-value { font-size: 28px; color: #a855f7; }
|
| 93 |
+
.turn-track { display: flex; gap: 4px; margin-top: 4px; }
|
| 94 |
+
.turn-dot {
|
| 95 |
+
flex: 1; height: 4px; border-radius: 2px;
|
| 96 |
+
background: #30363d; transition: background 0.3s;
|
| 97 |
+
}
|
| 98 |
+
.turn-dot.done { background: #7c3aed; }
|
| 99 |
+
.turn-dot.current { background: #a855f7; animation: pulse 1s infinite; }
|
| 100 |
+
.main { display: flex; flex-direction: column; overflow: hidden; }
|
| 101 |
+
.controls {
|
| 102 |
+
background: #161b22; border-bottom: 1px solid #30363d;
|
| 103 |
+
padding: 14px 24px; display: flex; align-items: center; gap: 12px;
|
| 104 |
+
}
|
| 105 |
+
.btn {
|
| 106 |
+
padding: 8px 18px; border-radius: 8px; font-size: 13px;
|
| 107 |
+
font-weight: 600; border: none; cursor: pointer;
|
| 108 |
+
transition: all 0.2s; display: flex; align-items: center; gap: 6px;
|
| 109 |
+
}
|
| 110 |
+
.btn-primary { background: #7c3aed; color: white; }
|
| 111 |
+
.btn-primary:hover { background: #6d28d9; }
|
| 112 |
+
.btn-primary:disabled { background: #3d2070; color: #8b6bb5; cursor: not-allowed; }
|
| 113 |
+
.btn-secondary { background: #21262d; color: #e6edf3; border: 1px solid #30363d; }
|
| 114 |
+
.btn-secondary:hover { background: #30363d; }
|
| 115 |
+
.btn-danger { background: #3a1a1a; color: #f85149; border: 1px solid #f8514940; }
|
| 116 |
+
.btn-danger:hover { background: #f8514920; }
|
| 117 |
+
.controls-right { margin-left: auto; display: flex; align-items: center; gap: 10px; }
|
| 118 |
+
.speed-label { font-size: 12px; color: #8b949e; }
|
| 119 |
+
.speed-select {
|
| 120 |
+
background: #21262d; border: 1px solid #30363d;
|
| 121 |
+
color: #e6edf3; border-radius: 6px; padding: 5px 10px; font-size: 12px;
|
| 122 |
+
}
|
| 123 |
+
.dialogue-area {
|
| 124 |
+
flex: 1; overflow-y: auto; padding: 24px;
|
| 125 |
+
display: flex; flex-direction: column; gap: 16px;
|
| 126 |
+
}
|
| 127 |
+
.empty-state {
|
| 128 |
+
flex: 1; display: flex; flex-direction: column;
|
| 129 |
+
align-items: center; justify-content: center;
|
| 130 |
+
gap: 12px; color: #8b949e; margin: auto;
|
| 131 |
+
}
|
| 132 |
+
.empty-icon { font-size: 48px; opacity: 0.4; }
|
| 133 |
+
.empty-title { font-size: 16px; font-weight: 600; color: #8b949e; }
|
| 134 |
+
.empty-sub { font-size: 13px; }
|
| 135 |
+
.message {
|
| 136 |
+
display: flex; gap: 12px; max-width: 85%;
|
| 137 |
+
animation: fadeUp 0.3s ease;
|
| 138 |
+
}
|
| 139 |
+
@keyframes fadeUp {
|
| 140 |
+
from { opacity:0; transform: translateY(8px); }
|
| 141 |
+
to { opacity:1; transform: translateY(0); }
|
| 142 |
+
}
|
| 143 |
+
.message.tutor { align-self: flex-start; }
|
| 144 |
+
.message.agent { align-self: flex-end; flex-direction: row-reverse; }
|
| 145 |
+
.avatar {
|
| 146 |
+
width: 36px; height: 36px; border-radius: 50%;
|
| 147 |
+
display: flex; align-items: center; justify-content: center;
|
| 148 |
+
font-size: 16px; flex-shrink: 0; margin-top: 2px;
|
| 149 |
+
}
|
| 150 |
+
.tutor .avatar { background: linear-gradient(135deg, #7c3aed, #a855f7); }
|
| 151 |
+
.agent .avatar { background: linear-gradient(135deg, #0d9488, #14b8a6); }
|
| 152 |
+
.bubble {
|
| 153 |
+
padding: 12px 16px; border-radius: 12px;
|
| 154 |
+
font-size: 14px; line-height: 1.6; max-width: 100%;
|
| 155 |
+
}
|
| 156 |
+
.tutor .bubble {
|
| 157 |
+
background: #161b22; border: 1px solid #30363d;
|
| 158 |
+
border-top-left-radius: 4px; color: #e6edf3;
|
| 159 |
+
}
|
| 160 |
+
.agent .bubble {
|
| 161 |
+
background: #13111e; border: 1px solid #7c3aed40;
|
| 162 |
+
border-top-right-radius: 4px; color: #e6edf3;
|
| 163 |
+
}
|
| 164 |
+
.bubble-meta {
|
| 165 |
+
font-size: 11px; color: #8b949e; margin-top: 6px;
|
| 166 |
+
display: flex; align-items: center; gap: 8px;
|
| 167 |
+
}
|
| 168 |
+
.agent .bubble-meta { justify-content: flex-end; }
|
| 169 |
+
.reward-pill {
|
| 170 |
+
display: inline-flex; align-items: center; gap: 4px;
|
| 171 |
+
padding: 2px 8px; border-radius: 10px;
|
| 172 |
+
font-size: 11px; font-weight: 600;
|
| 173 |
+
}
|
| 174 |
+
.reward-high { background: #1a3a2a; color: #3fb950; }
|
| 175 |
+
.reward-mid { background: #332d1a; color: #d29922; }
|
| 176 |
+
.reward-low { background: #3a1a1a; color: #f85149; }
|
| 177 |
+
.breakdown { display: flex; flex-wrap: wrap; gap: 4px; margin-top: 6px; }
|
| 178 |
+
.breakdown-item {
|
| 179 |
+
font-size: 10px; padding: 2px 7px; border-radius: 6px;
|
| 180 |
+
background: #21262d; border: 1px solid #30363d; color: #8b949e;
|
| 181 |
+
}
|
| 182 |
+
.typing { display: flex; gap: 12px; align-self: flex-start; }
|
| 183 |
+
.typing .avatar { background: linear-gradient(135deg, #0d9488, #14b8a6); }
|
| 184 |
+
.typing-dots {
|
| 185 |
+
background: #161b22; border: 1px solid #30363d;
|
| 186 |
+
border-radius: 12px; border-top-left-radius: 4px;
|
| 187 |
+
padding: 12px 16px; display: flex; gap: 4px; align-items: center;
|
| 188 |
+
}
|
| 189 |
+
.dot {
|
| 190 |
+
width: 6px; height: 6px; border-radius: 50%;
|
| 191 |
+
background: #8b949e; animation: bounce 1.2s infinite;
|
| 192 |
+
}
|
| 193 |
+
.dot:nth-child(2) { animation-delay: 0.2s; }
|
| 194 |
+
.dot:nth-child(3) { animation-delay: 0.4s; }
|
| 195 |
+
@keyframes bounce {
|
| 196 |
+
0%,60%,100%{transform:translateY(0)} 30%{transform:translateY(-6px)}
|
| 197 |
+
}
|
| 198 |
+
.input-area {
|
| 199 |
+
background: #161b22; border-top: 1px solid #30363d; padding: 16px 24px;
|
| 200 |
+
}
|
| 201 |
+
.input-row { display: flex; gap: 10px; }
|
| 202 |
+
.input-box {
|
| 203 |
+
flex: 1; background: #0d1117; border: 1px solid #30363d;
|
| 204 |
+
border-radius: 10px; padding: 10px 16px; color: #e6edf3;
|
| 205 |
+
font-size: 14px; font-family: inherit; resize: none;
|
| 206 |
+
transition: border 0.2s; min-height: 44px; max-height: 120px;
|
| 207 |
+
}
|
| 208 |
+
.input-box:focus { outline: none; border-color: #7c3aed; }
|
| 209 |
+
.input-box::placeholder { color: #484f58; }
|
| 210 |
+
.btn-send {
|
| 211 |
+
background: #7c3aed; border: none; border-radius: 10px;
|
| 212 |
+
color: white; padding: 10px 18px; cursor: pointer;
|
| 213 |
+
font-size: 18px; transition: background 0.2s; align-self: flex-end;
|
| 214 |
+
}
|
| 215 |
+
.btn-send:hover { background: #6d28d9; }
|
| 216 |
+
.btn-send:disabled { background: #3d2070; cursor: not-allowed; }
|
| 217 |
+
.input-hint {
|
| 218 |
+
font-size: 11px; color: #484f58; margin-top: 6px;
|
| 219 |
+
display: flex; justify-content: space-between;
|
| 220 |
+
}
|
| 221 |
+
.autorun-banner {
|
| 222 |
+
background: #13111e; border: 1px solid #7c3aed40;
|
| 223 |
+
border-radius: 8px; padding: 8px 14px; font-size: 12px;
|
| 224 |
+
color: #a855f7; display: none; align-items: center;
|
| 225 |
+
gap: 8px; margin-bottom: 10px;
|
| 226 |
+
}
|
| 227 |
+
.autorun-banner.visible { display: flex; }
|
| 228 |
+
.complete-banner {
|
| 229 |
+
background: #1a3a2a; border: 1px solid #3fb95040;
|
| 230 |
+
border-radius: 10px; padding: 16px 20px;
|
| 231 |
+
display: flex; align-items: center;
|
| 232 |
+
justify-content: space-between; animation: fadeUp 0.3s ease;
|
| 233 |
+
}
|
| 234 |
+
.complete-left { display: flex; align-items: center; gap: 12px; }
|
| 235 |
+
.complete-icon { font-size: 24px; }
|
| 236 |
+
.complete-title { font-size: 14px; font-weight: 600; color: #3fb950; }
|
| 237 |
+
.complete-sub { font-size: 12px; color: #8b949e; margin-top: 2px; }
|
| 238 |
+
.final-score { font-size: 28px; font-weight: 700; color: #3fb950; }
|
| 239 |
+
.system-msg {
|
| 240 |
+
text-align: center; font-size: 12px; color: #8b949e;
|
| 241 |
+
padding: 8px 16px; background: #161b22;
|
| 242 |
+
border: 1px solid #30363d; border-radius: 8px;
|
| 243 |
+
align-self: center;
|
| 244 |
+
}
|
| 245 |
+
.system-msg.error { color: #f85149; border-color: #f8514940; background: #3a1a1a; }
|
| 246 |
+
.system-msg.warning { color: #d29922; border-color: #d2992240; background: #332d1a; }
|
| 247 |
+
::-webkit-scrollbar { width: 4px; }
|
| 248 |
+
::-webkit-scrollbar-track { background: transparent; }
|
| 249 |
+
::-webkit-scrollbar-thumb { background: #30363d; border-radius: 2px; }
|
| 250 |
+
</style>
|
| 251 |
+
</head>
|
| 252 |
+
<body>
|
| 253 |
+
|
| 254 |
+
<div class="header">
|
| 255 |
+
<div class="header-left">
|
| 256 |
+
<div class="logo">π</div>
|
| 257 |
+
<div>
|
| 258 |
+
<h1>SocraticEnv</h1>
|
| 259 |
+
<p>OpenEnv Hackathon Β· Meta Γ PyTorch Γ Scaler</p>
|
| 260 |
+
</div>
|
| 261 |
+
</div>
|
| 262 |
+
<div class="header-right">
|
| 263 |
+
<a href="/ui/index.html" class="nav-link active">Live Demo</a>
|
| 264 |
+
<a href="/ui/leaderboard.html" class="nav-link">π Leaderboard</a>
|
| 265 |
+
<a href="/docs" class="nav-link">API Docs</a>
|
| 266 |
+
<div class="status-badge">
|
| 267 |
+
<div class="status-dot" id="statusDot"></div>
|
| 268 |
+
<span id="statusText">Connecting...</span>
|
| 269 |
+
</div>
|
| 270 |
+
</div>
|
| 271 |
+
</div>
|
| 272 |
+
|
| 273 |
+
<div class="container">
|
| 274 |
+
<div class="sidebar">
|
| 275 |
+
<div class="sidebar-section">
|
| 276 |
+
<div class="sidebar-title">Choose a Task</div>
|
| 277 |
+
<div class="task-card active" onclick="selectTask('factual_recall')" id="card-factual_recall">
|
| 278 |
+
<div class="task-header">
|
| 279 |
+
<span class="task-name">Factual Recall</span>
|
| 280 |
+
<span class="difficulty easy">Easy</span>
|
| 281 |
+
</div>
|
| 282 |
+
<div class="task-desc">Agent explains a concept. Graded on accuracy, key terms, and rejecting misconceptions.</div>
|
| 283 |
+
</div>
|
| 284 |
+
<div class="task-card" onclick="selectTask('socratic_dialogue')" id="card-socratic_dialogue">
|
| 285 |
+
<div class="task-header">
|
| 286 |
+
<span class="task-name">Socratic Dialogue</span>
|
| 287 |
+
<span class="difficulty medium">Medium</span>
|
| 288 |
+
</div>
|
| 289 |
+
<div class="task-desc">5-turn philosophical dialogue. Graded on reasoning depth and coherence.</div>
|
| 290 |
+
</div>
|
| 291 |
+
<div class="task-card" onclick="selectTask('misconception_trap')" id="card-misconception_trap">
|
| 292 |
+
<div class="task-header">
|
| 293 |
+
<span class="task-name">Misconception Trap</span>
|
| 294 |
+
<span class="difficulty hard">Hard</span>
|
| 295 |
+
</div>
|
| 296 |
+
<div class="task-desc">Tutor plants a false belief. Agent must detect, correct, and explain.</div>
|
| 297 |
+
</div>
|
| 298 |
+
<div class="task-card" onclick="selectTask('debate_mode')" id="card-debate_mode">
|
| 299 |
+
<div class="task-header">
|
| 300 |
+
<span class="task-name">Debate Mode</span>
|
| 301 |
+
<span class="difficulty medium">Medium</span>
|
| 302 |
+
</div>
|
| 303 |
+
<div class="task-desc">Agent argues both sides of a topic. Graded on argument quality and use of evidence.</div>
|
| 304 |
+
</div>
|
| 305 |
+
<div class="task-card" onclick="selectTask('analogy_challenge')" id="card-analogy_challenge">
|
| 306 |
+
<div class="task-header">
|
| 307 |
+
<span class="task-name">Analogy Challenge</span>
|
| 308 |
+
<span class="difficulty hard">Hard</span>
|
| 309 |
+
</div>
|
| 310 |
+
<div class="task-desc">Explain complex concepts using ONLY analogies. No technical jargon allowed!</div>
|
| 311 |
+
</div>
|
| 312 |
+
</div>
|
| 313 |
+
<div class="sidebar-section">
|
| 314 |
+
<div class="sidebar-title">Generate Custom Task</div>
|
| 315 |
+
<div style="margin-bottom:8px;">
|
| 316 |
+
<input
|
| 317 |
+
id="topicInput"
|
| 318 |
+
placeholder="Any topic... e.g. Black holes"
|
| 319 |
+
style="width:100%;background:#0d1117;border:1px solid #30363d;border-radius:8px;padding:8px 10px;color:#e6edf3;font-size:12px;font-family:inherit;outline:none;"
|
| 320 |
+
onkeydown="if(event.key==='Enter') generateTask()"
|
| 321 |
+
/>
|
| 322 |
+
</div>
|
| 323 |
+
<div style="display:flex;gap:6px;margin-bottom:8px;">
|
| 324 |
+
<select id="genDifficulty" style="flex:1;background:#21262d;border:1px solid #30363d;color:#e6edf3;border-radius:6px;padding:5px 8px;font-size:11px;">
|
| 325 |
+
<option value="easy">Easy</option>
|
| 326 |
+
<option value="medium" selected>Medium</option>
|
| 327 |
+
<option value="hard">Hard</option>
|
| 328 |
+
</select>
|
| 329 |
+
<button
|
| 330 |
+
onclick="generateTask()"
|
| 331 |
+
id="generateBtn"
|
| 332 |
+
style="flex:2;background:#7c3aed;color:white;border:none;border-radius:6px;padding:5px 10px;font-size:11px;font-weight:600;cursor:pointer;">
|
| 333 |
+
β¨ Generate
|
| 334 |
+
</button>
|
| 335 |
+
</div>
|
| 336 |
+
<div id="generateStatus" style="font-size:11px;color:#8b949e;min-height:16px;line-height:1.4;"></div>
|
| 337 |
+
</div>
|
| 338 |
+
<div class="sidebar-section">
|
| 339 |
+
<div class="sidebar-title">Live Scores</div>
|
| 340 |
+
<div class="score-grid">
|
| 341 |
+
<div class="score-card full">
|
| 342 |
+
<div class="score-value" id="overallScore">β</div>
|
| 343 |
+
<div class="score-label">Overall Score</div>
|
| 344 |
+
</div>
|
| 345 |
+
<div class="score-card">
|
| 346 |
+
<div class="score-value" id="turnCount" style="color:#d29922">0</div>
|
| 347 |
+
<div class="score-label">Turns</div>
|
| 348 |
+
</div>
|
| 349 |
+
<div class="score-card">
|
| 350 |
+
<div class="score-value" id="lastReward" style="color:#3fb950">β</div>
|
| 351 |
+
<div class="score-label">Last Reward</div>
|
| 352 |
+
</div>
|
| 353 |
+
</div>
|
| 354 |
+
</div>
|
| 355 |
+
|
| 356 |
+
<div class="sidebar-section">
|
| 357 |
+
<div class="sidebar-title">Turn Progress</div>
|
| 358 |
+
<div class="turn-track" id="turnTrack"></div>
|
| 359 |
+
<div style="font-size:11px;color:#8b949e;margin-top:8px" id="turnLabel">No active episode</div>
|
| 360 |
+
</div>
|
| 361 |
+
|
| 362 |
+
<div class="sidebar-section">
|
| 363 |
+
<div class="sidebar-title">Session History</div>
|
| 364 |
+
<div id="sessionHistory" style="font-size:12px;color:#8b949e;">
|
| 365 |
+
No completed episodes yet.
|
| 366 |
+
</div>
|
| 367 |
+
</div>
|
| 368 |
+
</div>
|
| 369 |
+
|
| 370 |
+
<div class="main">
|
| 371 |
+
<div class="controls">
|
| 372 |
+
<button class="btn btn-primary" id="btnStart" onclick="startEpisode()">βΆ Start Episode</button>
|
| 373 |
+
<button class="btn btn-secondary" id="btnAutoRun" onclick="toggleAutoRun()">β‘ Auto-Run AI</button>
|
| 374 |
+
<button class="btn btn-danger" onclick="resetAll()">βΊ Reset</button>
|
| 375 |
+
<div class="controls-right">
|
| 376 |
+
<span class="speed-label">Speed:</span>
|
| 377 |
+
<select class="speed-select" id="speedSelect">
|
| 378 |
+
<option value="2000">Slow</option>
|
| 379 |
+
<option value="1000" selected>Normal</option>
|
| 380 |
+
<option value="400">Fast</option>
|
| 381 |
+
</select>
|
| 382 |
+
</div>
|
| 383 |
+
</div>
|
| 384 |
+
|
| 385 |
+
<div class="dialogue-area" id="dialogueArea">
|
| 386 |
+
<div class="empty-state" id="emptyState">
|
| 387 |
+
<div class="empty-icon">π</div>
|
| 388 |
+
<div class="empty-title">SocraticEnv is ready</div>
|
| 389 |
+
<div class="empty-sub">Select a task and click Start Episode</div>
|
| 390 |
+
</div>
|
| 391 |
+
</div>
|
| 392 |
+
|
| 393 |
+
<div class="input-area">
|
| 394 |
+
<div class="autorun-banner" id="autorunBanner">
|
| 395 |
+
<span>β‘</span>
|
| 396 |
+
<span id="autorunStatus">Auto-Run mode β AI is thinking...</span>
|
| 397 |
+
</div>
|
| 398 |
+
<div class="input-row">
|
| 399 |
+
<textarea
|
| 400 |
+
class="input-box" id="inputBox"
|
| 401 |
+
placeholder="Type your response as the student agent... (or use Auto-Run AI)"
|
| 402 |
+
rows="1" disabled onkeydown="handleKey(event)"
|
| 403 |
+
></textarea>
|
| 404 |
+
<button class="btn-send" id="btnSend" onclick="sendManual()" disabled>β€</button>
|
| 405 |
+
</div>
|
| 406 |
+
<div class="input-hint">
|
| 407 |
+
<span>Press Enter to send Β· Shift+Enter for new line</span>
|
| 408 |
+
<span id="taskHint">No active task</span>
|
| 409 |
+
</div>
|
| 410 |
+
</div>
|
| 411 |
+
</div>
|
| 412 |
+
</div>
|
| 413 |
+
|
| 414 |
+
<script>
|
| 415 |
+
const API = window.location.origin;
|
| 416 |
+
let selectedTask = 'factual_recall';
|
| 417 |
+
let episodeActive = false;
|
| 418 |
+
let autoRunning = false;
|
| 419 |
+
let autoRunTimer = null;
|
| 420 |
+
let totalScore = 0;
|
| 421 |
+
let turnCount = 0;
|
| 422 |
+
let maxTurns = 3;
|
| 423 |
+
let sessionResults = [];
|
| 424 |
+
let currentHistory = [];
|
| 425 |
+
|
| 426 |
+
async function checkStatus() {
|
| 427 |
+
try {
|
| 428 |
+
const r = await fetch(`${API}/ping`);
|
| 429 |
+
const dot = document.getElementById('statusDot');
|
| 430 |
+
const txt = document.getElementById('statusText');
|
| 431 |
+
if (r.ok) {
|
| 432 |
+
dot.classList.remove('offline');
|
| 433 |
+
txt.textContent = 'Environment online';
|
| 434 |
+
} else {
|
| 435 |
+
dot.classList.add('offline');
|
| 436 |
+
txt.textContent = 'Environment offline';
|
| 437 |
+
}
|
| 438 |
+
} catch {
|
| 439 |
+
document.getElementById('statusDot').classList.add('offline');
|
| 440 |
+
document.getElementById('statusText').textContent = 'Cannot connect';
|
| 441 |
+
}
|
| 442 |
+
}
|
| 443 |
+
checkStatus();
|
| 444 |
+
setInterval(checkStatus, 5000);
|
| 445 |
+
|
| 446 |
+
function selectTask(taskId) {
|
| 447 |
+
selectedTask = taskId;
|
| 448 |
+
document.querySelectorAll('.task-card').forEach(c => c.classList.remove('active'));
|
| 449 |
+
document.getElementById(`card-${taskId}`).classList.add('active');
|
| 450 |
+
const hints = {
|
| 451 |
+
factual_recall: 'Easy β Explain a concept clearly',
|
| 452 |
+
socratic_dialogue: 'Medium β Engage in 5-turn reasoning',
|
| 453 |
+
misconception_trap:'Hard β Catch the planted false belief!',
|
| 454 |
+
debate_mode: 'Medium β Argue both sides convincingly',
|
| 455 |
+
analogy_challenge: 'Hard β No jargon, analogies only!',
|
| 456 |
+
};
|
| 457 |
+
document.getElementById('taskHint').textContent = hints[taskId];
|
| 458 |
+
}
|
| 459 |
+
|
| 460 |
+
async function startEpisode() {
|
| 461 |
+
clearDialogue();
|
| 462 |
+
episodeActive = true;
|
| 463 |
+
totalScore = 0;
|
| 464 |
+
turnCount = 0;
|
| 465 |
+
currentHistory = [];
|
| 466 |
+
|
| 467 |
+
const maxMap = { factual_recall: 3, socratic_dialogue: 5, misconception_trap: 3, debate_mode: 4, analogy_challenge: 3 };
|
| 468 |
+
maxTurns = maxMap[selectedTask];
|
| 469 |
+
buildTurnTrack(maxTurns);
|
| 470 |
+
updateScores();
|
| 471 |
+
|
| 472 |
+
document.getElementById('btnStart').disabled = true;
|
| 473 |
+
document.getElementById('emptyState')?.remove();
|
| 474 |
+
|
| 475 |
+
try {
|
| 476 |
+
const r = await fetch(`${API}/reset`, {
|
| 477 |
+
method: 'POST',
|
| 478 |
+
headers: { 'Content-Type': 'application/json' },
|
| 479 |
+
body: JSON.stringify({ task_id: selectedTask }),
|
| 480 |
+
});
|
| 481 |
+
const data = await r.json();
|
| 482 |
+
const question = data.observation.question;
|
| 483 |
+
currentHistory.push({ role: 'tutor', content: question });
|
| 484 |
+
addTutorMessage(question);
|
| 485 |
+
enableInput();
|
| 486 |
+
document.getElementById('turnLabel').textContent = `Turn 1 of ${maxTurns}`;
|
| 487 |
+
} catch (e) {
|
| 488 |
+
addSystemMessage('β Could not connect to environment.', 'error');
|
| 489 |
+
document.getElementById('btnStart').disabled = false;
|
| 490 |
+
episodeActive = false;
|
| 491 |
+
}
|
| 492 |
+
}
|
| 493 |
+
|
| 494 |
+
async function sendResponse(response) {
|
| 495 |
+
if (!episodeActive || !response || !response.trim()) return;
|
| 496 |
+
|
| 497 |
+
disableInput();
|
| 498 |
+
addAgentMessage(response);
|
| 499 |
+
currentHistory.push({ role: 'agent', content: response });
|
| 500 |
+
|
| 501 |
+
showTyping();
|
| 502 |
+
await sleep(300);
|
| 503 |
+
|
| 504 |
+
try {
|
| 505 |
+
const r = await fetch(`${API}/step`, {
|
| 506 |
+
method: 'POST',
|
| 507 |
+
headers: { 'Content-Type': 'application/json' },
|
| 508 |
+
body: JSON.stringify({ response }),
|
| 509 |
+
});
|
| 510 |
+
const data = await r.json();
|
| 511 |
+
removeTyping();
|
| 512 |
+
|
| 513 |
+
turnCount++;
|
| 514 |
+
const score = data.reward.score;
|
| 515 |
+
totalScore += score;
|
| 516 |
+
updateScores(score);
|
| 517 |
+
updateTurnTrack(turnCount);
|
| 518 |
+
|
| 519 |
+
const nextQuestion = data.observation.question;
|
| 520 |
+
currentHistory.push({ role: 'tutor', content: nextQuestion });
|
| 521 |
+
addTutorMessage(nextQuestion, data.reward);
|
| 522 |
+
|
| 523 |
+
if (data.done) {
|
| 524 |
+
episodeActive = false;
|
| 525 |
+
stopAutoRun();
|
| 526 |
+
const avg = totalScore / turnCount;
|
| 527 |
+
showComplete(avg, data.reward.feedback);
|
| 528 |
+
saveToHistory(selectedTask, avg);
|
| 529 |
+
document.getElementById('btnStart').disabled = false;
|
| 530 |
+
} else {
|
| 531 |
+
if (!autoRunning) enableInput();
|
| 532 |
+
}
|
| 533 |
+
} catch (e) {
|
| 534 |
+
removeTyping();
|
| 535 |
+
addSystemMessage(`β Step error: ${e.message}`, 'error');
|
| 536 |
+
enableInput();
|
| 537 |
+
}
|
| 538 |
+
}
|
| 539 |
+
|
| 540 |
+
function sendManual() {
|
| 541 |
+
const box = document.getElementById('inputBox');
|
| 542 |
+
const val = box.value.trim();
|
| 543 |
+
if (!val) return;
|
| 544 |
+
box.value = '';
|
| 545 |
+
box.style.height = '44px';
|
| 546 |
+
sendResponse(val);
|
| 547 |
+
}
|
| 548 |
+
|
| 549 |
+
function handleKey(e) {
|
| 550 |
+
if (e.key === 'Enter' && !e.shiftKey) { e.preventDefault(); sendManual(); }
|
| 551 |
+
}
|
| 552 |
+
|
| 553 |
+
async function getAIResponse(question, history) {
|
| 554 |
+
document.getElementById('autorunStatus').textContent = 'β‘ AI is thinking...';
|
| 555 |
+
try {
|
| 556 |
+
const r = await fetch(`${API}/inference`, {
|
| 557 |
+
method: 'POST',
|
| 558 |
+
headers: { 'Content-Type': 'application/json' },
|
| 559 |
+
body: JSON.stringify({ message: question, history: history }),
|
| 560 |
+
});
|
| 561 |
+
if (!r.ok) {
|
| 562 |
+
const err = await r.text();
|
| 563 |
+
addSystemMessage(`β οΈ Inference API error ${r.status}: ${err}`, 'warning');
|
| 564 |
+
return null;
|
| 565 |
+
}
|
| 566 |
+
const data = await r.json();
|
| 567 |
+
if (data.response && data.response.startsWith('ERROR:')) {
|
| 568 |
+
addSystemMessage(`β οΈ ${data.response}`, 'warning');
|
| 569 |
+
return null;
|
| 570 |
+
}
|
| 571 |
+
document.getElementById('autorunStatus').textContent = 'β‘ Auto-Run mode β AI is responding';
|
| 572 |
+
return data.response;
|
| 573 |
+
} catch (e) {
|
| 574 |
+
addSystemMessage(`β οΈ Could not reach /inference: ${e.message}`, 'warning');
|
| 575 |
+
return null;
|
| 576 |
+
}
|
| 577 |
+
}
|
| 578 |
+
|
| 579 |
+
function toggleAutoRun() {
|
| 580 |
+
if (autoRunning) { stopAutoRun(); return; }
|
| 581 |
+
if (!episodeActive) {
|
| 582 |
+
startEpisode().then(() => {
|
| 583 |
+
if (episodeActive) { autoRunning = true; startAutoRun(); }
|
| 584 |
+
});
|
| 585 |
+
} else {
|
| 586 |
+
autoRunning = true;
|
| 587 |
+
startAutoRun();
|
| 588 |
+
}
|
| 589 |
+
}
|
| 590 |
+
|
| 591 |
+
function startAutoRun() {
|
| 592 |
+
autoRunning = true;
|
| 593 |
+
document.getElementById('autorunBanner').classList.add('visible');
|
| 594 |
+
document.getElementById('btnAutoRun').textContent = 'βΉ Stop Auto-Run';
|
| 595 |
+
disableInput();
|
| 596 |
+
runNextAutoStep();
|
| 597 |
+
}
|
| 598 |
+
|
| 599 |
+
async function runNextAutoStep() {
|
| 600 |
+
if (!autoRunning || !episodeActive) return;
|
| 601 |
+
const speed = parseInt(document.getElementById('speedSelect').value);
|
| 602 |
+
await sleep(speed);
|
| 603 |
+
if (!autoRunning || !episodeActive) return;
|
| 604 |
+
|
| 605 |
+
const tutorMessages = currentHistory.filter(h => h.role === 'tutor');
|
| 606 |
+
if (tutorMessages.length === 0) { stopAutoRun(); return; }
|
| 607 |
+
const lastQuestion = tutorMessages[tutorMessages.length - 1].content;
|
| 608 |
+
|
| 609 |
+
const response = await getAIResponse(lastQuestion, currentHistory);
|
| 610 |
+
if (!response) {
|
| 611 |
+
stopAutoRun();
|
| 612 |
+
addSystemMessage('β οΈ Auto-Run stopped. Check HuggingFace Space secrets or type manually.', 'warning');
|
| 613 |
+
if (episodeActive) enableInput();
|
| 614 |
+
return;
|
| 615 |
+
}
|
| 616 |
+
if (!autoRunning || !episodeActive) return;
|
| 617 |
+
await sendResponse(response);
|
| 618 |
+
if (episodeActive && autoRunning) runNextAutoStep();
|
| 619 |
+
}
|
| 620 |
+
|
| 621 |
+
function stopAutoRun() {
|
| 622 |
+
autoRunning = false;
|
| 623 |
+
clearTimeout(autoRunTimer);
|
| 624 |
+
document.getElementById('autorunBanner').classList.remove('visible');
|
| 625 |
+
document.getElementById('btnAutoRun').textContent = 'β‘ Auto-Run AI';
|
| 626 |
+
if (episodeActive) enableInput();
|
| 627 |
+
}
|
| 628 |
+
|
| 629 |
+
function resetAll() {
|
| 630 |
+
episodeActive = false;
|
| 631 |
+
autoRunning = false;
|
| 632 |
+
currentHistory = [];
|
| 633 |
+
clearTimeout(autoRunTimer);
|
| 634 |
+
stopAutoRun();
|
| 635 |
+
clearDialogue();
|
| 636 |
+
totalScore = 0; turnCount = 0;
|
| 637 |
+
document.getElementById('overallScore').textContent = 'β';
|
| 638 |
+
document.getElementById('turnCount').textContent = '0';
|
| 639 |
+
document.getElementById('lastReward').textContent = 'β';
|
| 640 |
+
document.getElementById('turnTrack').innerHTML = '';
|
| 641 |
+
document.getElementById('turnLabel').textContent = 'No active episode';
|
| 642 |
+
document.getElementById('btnStart').disabled = false;
|
| 643 |
+
disableInput();
|
| 644 |
+
document.getElementById('dialogueArea').innerHTML =
|
| 645 |
+
`<div class="empty-state" id="emptyState">
|
| 646 |
+
<div class="empty-icon">π</div>
|
| 647 |
+
<div class="empty-title">SocraticEnv is ready</div>
|
| 648 |
+
<div class="empty-sub">Select a task and click Start Episode</div>
|
| 649 |
+
</div>`;
|
| 650 |
+
}
|
| 651 |
+
|
| 652 |
+
function addTutorMessage(text, reward = null) {
|
| 653 |
+
const area = document.getElementById('dialogueArea');
|
| 654 |
+
const div = document.createElement('div');
|
| 655 |
+
div.className = 'message tutor';
|
| 656 |
+
let rewardHtml = '', breakdownHtml = '';
|
| 657 |
+
if (reward) {
|
| 658 |
+
const sc = reward.score;
|
| 659 |
+
const cls = sc >= 0.7 ? 'reward-high' : sc >= 0.4 ? 'reward-mid' : 'reward-low';
|
| 660 |
+
rewardHtml = `<span class="reward-pill ${cls}">+${sc.toFixed(3)}</span>`;
|
| 661 |
+
const bd = Object.entries(reward.breakdown)
|
| 662 |
+
.map(([k,v]) => `<span class="breakdown-item">${k}: ${v}</span>`).join('');
|
| 663 |
+
breakdownHtml = `<div class="breakdown">${bd}</div>`;
|
| 664 |
+
}
|
| 665 |
+
div.innerHTML = `
|
| 666 |
+
<div class="avatar">π</div>
|
| 667 |
+
<div>
|
| 668 |
+
<div class="bubble">${text}</div>
|
| 669 |
+
<div class="bubble-meta">Tutor ${rewardHtml}</div>
|
| 670 |
+
${breakdownHtml}
|
| 671 |
+
</div>`;
|
| 672 |
+
area.appendChild(div);
|
| 673 |
+
area.scrollTop = area.scrollHeight;
|
| 674 |
+
}
|
| 675 |
+
|
| 676 |
+
function addAgentMessage(text) {
|
| 677 |
+
const area = document.getElementById('dialogueArea');
|
| 678 |
+
const div = document.createElement('div');
|
| 679 |
+
div.className = 'message agent';
|
| 680 |
+
div.innerHTML = `
|
| 681 |
+
<div class="avatar">π€</div>
|
| 682 |
+
<div>
|
| 683 |
+
<div class="bubble">${text}</div>
|
| 684 |
+
<div class="bubble-meta">Agent</div>
|
| 685 |
+
</div>`;
|
| 686 |
+
area.appendChild(div);
|
| 687 |
+
area.scrollTop = area.scrollHeight;
|
| 688 |
+
}
|
| 689 |
+
|
| 690 |
+
function addSystemMessage(text, type = '') {
|
| 691 |
+
const area = document.getElementById('dialogueArea');
|
| 692 |
+
const div = document.createElement('div');
|
| 693 |
+
div.className = `system-msg ${type}`;
|
| 694 |
+
div.textContent = text;
|
| 695 |
+
area.appendChild(div);
|
| 696 |
+
area.scrollTop = area.scrollHeight;
|
| 697 |
+
}
|
| 698 |
+
|
| 699 |
+
function showTyping() {
|
| 700 |
+
const area = document.getElementById('dialogueArea');
|
| 701 |
+
const div = document.createElement('div');
|
| 702 |
+
div.className = 'typing'; div.id = 'typingIndicator';
|
| 703 |
+
div.innerHTML = `
|
| 704 |
+
<div class="avatar">π€</div>
|
| 705 |
+
<div class="typing-dots">
|
| 706 |
+
<div class="dot"></div><div class="dot"></div><div class="dot"></div>
|
| 707 |
+
</div>`;
|
| 708 |
+
area.appendChild(div);
|
| 709 |
+
area.scrollTop = area.scrollHeight;
|
| 710 |
+
}
|
| 711 |
+
|
| 712 |
+
function removeTyping() { document.getElementById('typingIndicator')?.remove(); }
|
| 713 |
+
|
| 714 |
+
function showComplete(score, feedback) {
|
| 715 |
+
const area = document.getElementById('dialogueArea');
|
| 716 |
+
const div = document.createElement('div');
|
| 717 |
+
div.innerHTML = `
|
| 718 |
+
<div class="complete-banner">
|
| 719 |
+
<div class="complete-left">
|
| 720 |
+
<div class="complete-icon">${score >= 0.7 ? 'π' : score >= 0.5 ? 'β
' : 'π'}</div>
|
| 721 |
+
<div>
|
| 722 |
+
<div class="complete-title">Episode Complete</div>
|
| 723 |
+
<div class="complete-sub">${feedback}</div>
|
| 724 |
+
</div>
|
| 725 |
+
</div>
|
| 726 |
+
<div class="final-score">${score.toFixed(3)}</div>
|
| 727 |
+
</div>`;
|
| 728 |
+
area.appendChild(div);
|
| 729 |
+
area.scrollTop = area.scrollHeight;
|
| 730 |
+
document.getElementById('overallScore').textContent = score.toFixed(3);
|
| 731 |
+
document.getElementById('overallScore').style.color =
|
| 732 |
+
score >= 0.7 ? '#3fb950' : score >= 0.5 ? '#d29922' : '#f85149';
|
| 733 |
+
}
|
| 734 |
+
|
| 735 |
+
function clearDialogue() { document.getElementById('dialogueArea').innerHTML = ''; }
|
| 736 |
+
|
| 737 |
+
function enableInput() {
|
| 738 |
+
document.getElementById('inputBox').disabled = false;
|
| 739 |
+
document.getElementById('btnSend').disabled = false;
|
| 740 |
+
document.getElementById('inputBox').focus();
|
| 741 |
+
}
|
| 742 |
+
|
| 743 |
+
function disableInput() {
|
| 744 |
+
document.getElementById('inputBox').disabled = true;
|
| 745 |
+
document.getElementById('btnSend').disabled = true;
|
| 746 |
+
}
|
| 747 |
+
|
| 748 |
+
function buildTurnTrack(n) {
|
| 749 |
+
const track = document.getElementById('turnTrack');
|
| 750 |
+
track.innerHTML = '';
|
| 751 |
+
for (let i = 0; i < n; i++) {
|
| 752 |
+
const d = document.createElement('div');
|
| 753 |
+
d.className = 'turn-dot'; d.id = `dot-${i}`;
|
| 754 |
+
track.appendChild(d);
|
| 755 |
+
}
|
| 756 |
+
}
|
| 757 |
+
|
| 758 |
+
function updateTurnTrack(turn) {
|
| 759 |
+
for (let i = 0; i < maxTurns; i++) {
|
| 760 |
+
const d = document.getElementById(`dot-${i}`);
|
| 761 |
+
if (!d) continue;
|
| 762 |
+
if (i < turn) d.className = 'turn-dot done';
|
| 763 |
+
else if (i===turn) d.className = 'turn-dot current';
|
| 764 |
+
else d.className = 'turn-dot';
|
| 765 |
+
}
|
| 766 |
+
document.getElementById('turnLabel').textContent = `Turn ${turn} of ${maxTurns}`;
|
| 767 |
+
}
|
| 768 |
+
|
| 769 |
+
function updateScores(lastReward = null) {
|
| 770 |
+
document.getElementById('turnCount').textContent = turnCount;
|
| 771 |
+
if (lastReward !== null) {
|
| 772 |
+
document.getElementById('lastReward').textContent = lastReward.toFixed(3);
|
| 773 |
+
document.getElementById('lastReward').style.color =
|
| 774 |
+
lastReward >= 0.7 ? '#3fb950' : lastReward >= 0.4 ? '#d29922' : '#f85149';
|
| 775 |
+
}
|
| 776 |
+
if (turnCount > 0) {
|
| 777 |
+
document.getElementById('overallScore').textContent =
|
| 778 |
+
(totalScore / turnCount).toFixed(3);
|
| 779 |
+
}
|
| 780 |
+
}
|
| 781 |
+
|
| 782 |
+
function saveToHistory(task, score) {
|
| 783 |
+
sessionResults.unshift({ task, score });
|
| 784 |
+
document.getElementById('sessionHistory').innerHTML =
|
| 785 |
+
sessionResults.slice(0, 5).map(r => `
|
| 786 |
+
<div style="display:flex;justify-content:space-between;padding:6px 0;border-bottom:1px solid #21262d;">
|
| 787 |
+
<span style="color:#c9d1d9">${r.task.replace(/_/g,' ')}</span>
|
| 788 |
+
<span style="color:${r.score>=0.7?'#3fb950':r.score>=0.5?'#d29922':'#f85149'};font-weight:600">
|
| 789 |
+
${r.score.toFixed(3)}
|
| 790 |
+
</span>
|
| 791 |
+
</div>`).join('');
|
| 792 |
+
}
|
| 793 |
+
|
| 794 |
+
// ββ Custom Task Generator βββββββββββββββββββββββββββββββββ
|
| 795 |
+
async function generateTask() {
|
| 796 |
+
const topic = document.getElementById('topicInput').value.trim();
|
| 797 |
+
const difficulty = document.getElementById('genDifficulty').value;
|
| 798 |
+
const status = document.getElementById('generateStatus');
|
| 799 |
+
const btn = document.getElementById('generateBtn');
|
| 800 |
+
|
| 801 |
+
if (!topic) {
|
| 802 |
+
status.textContent = 'β οΈ Please enter a topic first.';
|
| 803 |
+
status.style.color = '#d29922';
|
| 804 |
+
return;
|
| 805 |
+
}
|
| 806 |
+
|
| 807 |
+
btn.disabled = true;
|
| 808 |
+
btn.textContent = 'β³ Generating...';
|
| 809 |
+
status.style.color = '#a855f7';
|
| 810 |
+
status.textContent = `Generating ${difficulty} task about "${topic}"...`;
|
| 811 |
+
|
| 812 |
+
try {
|
| 813 |
+
const r = await fetch(`${API}/generate_task`, {
|
| 814 |
+
method: 'POST',
|
| 815 |
+
headers: { 'Content-Type': 'application/json' },
|
| 816 |
+
body: JSON.stringify({ topic, difficulty }),
|
| 817 |
+
});
|
| 818 |
+
const data = await r.json();
|
| 819 |
+
|
| 820 |
+
if (data.error) {
|
| 821 |
+
status.style.color = '#f85149';
|
| 822 |
+
status.textContent = `β ${data.error}`;
|
| 823 |
+
} else {
|
| 824 |
+
status.style.color = '#3fb950';
|
| 825 |
+
status.textContent = `β
Ready! "${data.preview.substring(0, 60)}..."`;
|
| 826 |
+
|
| 827 |
+
// Auto-select the matching task
|
| 828 |
+
selectTask(data.task_id);
|
| 829 |
+
|
| 830 |
+
// Clear input
|
| 831 |
+
document.getElementById('topicInput').value = '';
|
| 832 |
+
}
|
| 833 |
+
} catch(e) {
|
| 834 |
+
status.style.color = '#f85149';
|
| 835 |
+
status.textContent = `β ${e.message}`;
|
| 836 |
+
} finally {
|
| 837 |
+
btn.disabled = false;
|
| 838 |
+
btn.textContent = 'β¨ Generate';
|
| 839 |
+
}
|
| 840 |
+
}
|
| 841 |
+
|
| 842 |
+
function sleep(ms) { return new Promise(r => setTimeout(r, ms)); }
|
| 843 |
+
|
| 844 |
+
document.getElementById('inputBox').addEventListener('input', function() {
|
| 845 |
+
this.style.height = '44px';
|
| 846 |
+
this.style.height = Math.min(this.scrollHeight, 120) + 'px';
|
| 847 |
+
});
|
| 848 |
+
</script>
|
| 849 |
+
</body>
|
| 850 |
+
</html>
|
static/leaderboard.html
ADDED
|
@@ -0,0 +1,377 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!DOCTYPE html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="UTF-8"/>
|
| 5 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
|
| 6 |
+
<title>SocraticEnv β Model Leaderboard</title>
|
| 7 |
+
<style>
|
| 8 |
+
* { margin:0; padding:0; box-sizing:border-box; }
|
| 9 |
+
body { font-family:'Segoe UI',system-ui,sans-serif; background:#0d1117; color:#e6edf3; min-height:100vh; }
|
| 10 |
+
.header {
|
| 11 |
+
background:#161b22; border-bottom:1px solid #30363d;
|
| 12 |
+
padding:16px 32px; display:flex; align-items:center;
|
| 13 |
+
justify-content:space-between;
|
| 14 |
+
}
|
| 15 |
+
.header-left { display:flex; align-items:center; gap:12px; }
|
| 16 |
+
.logo {
|
| 17 |
+
width:36px; height:36px;
|
| 18 |
+
background:linear-gradient(135deg,#7c3aed,#a855f7);
|
| 19 |
+
border-radius:8px; display:flex; align-items:center;
|
| 20 |
+
justify-content:center; font-size:18px;
|
| 21 |
+
}
|
| 22 |
+
.header h1 { font-size:18px; font-weight:600; }
|
| 23 |
+
.header p { font-size:12px; color:#8b949e; margin-top:2px; }
|
| 24 |
+
.nav-links { display:flex; gap:8px; }
|
| 25 |
+
.nav-link {
|
| 26 |
+
padding:6px 14px; border-radius:8px; font-size:12px;
|
| 27 |
+
font-weight:600; text-decoration:none; border:1px solid #30363d;
|
| 28 |
+
color:#8b949e; background:#21262d; transition:all 0.2s;
|
| 29 |
+
}
|
| 30 |
+
.nav-link:hover { color:#e6edf3; border-color:#7c3aed; }
|
| 31 |
+
.nav-link.active { color:#a855f7; border-color:#7c3aed; background:#13111e; }
|
| 32 |
+
.container { max-width:1000px; margin:0 auto; padding:32px 24px; }
|
| 33 |
+
.page-title { font-size:24px; font-weight:700; margin-bottom:6px; }
|
| 34 |
+
.page-sub { font-size:13px; color:#8b949e; margin-bottom:28px; }
|
| 35 |
+
|
| 36 |
+
/* Run panel */
|
| 37 |
+
.run-panel {
|
| 38 |
+
background:#161b22; border:1px solid #30363d;
|
| 39 |
+
border-radius:12px; padding:20px; margin-bottom:28px;
|
| 40 |
+
}
|
| 41 |
+
.run-title { font-size:14px; font-weight:600; margin-bottom:14px; color:#e6edf3; }
|
| 42 |
+
.run-row { display:flex; gap:10px; align-items:center; }
|
| 43 |
+
.run-input {
|
| 44 |
+
flex:1; background:#0d1117; border:1px solid #30363d;
|
| 45 |
+
border-radius:8px; padding:9px 14px; color:#e6edf3;
|
| 46 |
+
font-size:13px; font-family:inherit;
|
| 47 |
+
}
|
| 48 |
+
.run-input:focus { outline:none; border-color:#7c3aed; }
|
| 49 |
+
.run-input::placeholder { color:#484f58; }
|
| 50 |
+
.btn {
|
| 51 |
+
padding:9px 18px; border-radius:8px; font-size:13px;
|
| 52 |
+
font-weight:600; border:none; cursor:pointer;
|
| 53 |
+
transition:all 0.2s; white-space:nowrap;
|
| 54 |
+
}
|
| 55 |
+
.btn-primary { background:#7c3aed; color:white; }
|
| 56 |
+
.btn-primary:hover { background:#6d28d9; }
|
| 57 |
+
.btn-primary:disabled { background:#3d2070; color:#8b6bb5; cursor:not-allowed; }
|
| 58 |
+
.run-status {
|
| 59 |
+
margin-top:12px; font-size:12px; color:#8b949e;
|
| 60 |
+
min-height:20px; display:flex; align-items:center; gap:8px;
|
| 61 |
+
}
|
| 62 |
+
.spinner {
|
| 63 |
+
width:14px; height:14px; border:2px solid #30363d;
|
| 64 |
+
border-top-color:#7c3aed; border-radius:50%;
|
| 65 |
+
animation:spin 0.8s linear infinite; display:none;
|
| 66 |
+
}
|
| 67 |
+
@keyframes spin { to { transform:rotate(360deg); } }
|
| 68 |
+
|
| 69 |
+
/* Stats row */
|
| 70 |
+
.stats-row { display:grid; grid-template-columns:repeat(3,1fr); gap:12px; margin-bottom:24px; }
|
| 71 |
+
.stat-card {
|
| 72 |
+
background:#161b22; border:1px solid #30363d;
|
| 73 |
+
border-radius:10px; padding:16px; text-align:center;
|
| 74 |
+
}
|
| 75 |
+
.stat-val { font-size:28px; font-weight:700; color:#7c3aed; }
|
| 76 |
+
.stat-lbl { font-size:11px; color:#8b949e; margin-top:4px; }
|
| 77 |
+
|
| 78 |
+
/* Table */
|
| 79 |
+
.table-wrap {
|
| 80 |
+
background:#161b22; border:1px solid #30363d;
|
| 81 |
+
border-radius:12px; overflow:hidden;
|
| 82 |
+
}
|
| 83 |
+
.table-header {
|
| 84 |
+
display:grid;
|
| 85 |
+
grid-template-columns:40px 1fr 100px 100px 100px 110px 140px;
|
| 86 |
+
padding:10px 16px; background:#0d1117;
|
| 87 |
+
border-bottom:1px solid #30363d;
|
| 88 |
+
font-size:10px; font-weight:600; color:#8b949e;
|
| 89 |
+
letter-spacing:0.8px; text-transform:uppercase;
|
| 90 |
+
}
|
| 91 |
+
.table-row {
|
| 92 |
+
display:grid;
|
| 93 |
+
grid-template-columns:40px 1fr 100px 100px 100px 110px 140px;
|
| 94 |
+
padding:14px 16px; border-bottom:1px solid #21262d;
|
| 95 |
+
align-items:center; transition:background 0.15s;
|
| 96 |
+
}
|
| 97 |
+
.table-row:last-child { border-bottom:none; }
|
| 98 |
+
.table-row:hover { background:#1c2128; }
|
| 99 |
+
.table-row.top { background:#13111e; }
|
| 100 |
+
.rank { font-size:14px; font-weight:700; color:#8b949e; }
|
| 101 |
+
.rank.gold { color:#f59e0b; }
|
| 102 |
+
.rank.silver { color:#94a3b8; }
|
| 103 |
+
.rank.bronze { color:#cd7f32; }
|
| 104 |
+
.model-name { font-size:13px; font-weight:600; color:#e6edf3; }
|
| 105 |
+
.model-time { font-size:10px; color:#484f58; margin-top:2px; }
|
| 106 |
+
.score-cell { text-align:center; }
|
| 107 |
+
.score-val {
|
| 108 |
+
font-size:13px; font-weight:600;
|
| 109 |
+
padding:3px 10px; border-radius:6px; display:inline-block;
|
| 110 |
+
}
|
| 111 |
+
.score-high { background:#1a3a2a; color:#3fb950; }
|
| 112 |
+
.score-mid { background:#332d1a; color:#d29922; }
|
| 113 |
+
.score-low { background:#3a1a1a; color:#f85149; }
|
| 114 |
+
.overall-val {
|
| 115 |
+
font-size:15px; font-weight:700; text-align:center;
|
| 116 |
+
}
|
| 117 |
+
.bar-wrap { display:flex; align-items:center; gap:6px; }
|
| 118 |
+
.bar-bg { flex:1; height:6px; background:#21262d; border-radius:3px; overflow:hidden; }
|
| 119 |
+
.bar-fill { height:100%; border-radius:3px; transition:width 0.6s ease; }
|
| 120 |
+
.delete-btn {
|
| 121 |
+
background:none; border:none; color:#484f58;
|
| 122 |
+
cursor:pointer; font-size:12px; padding:4px 8px;
|
| 123 |
+
border-radius:4px; transition:all 0.2s;
|
| 124 |
+
}
|
| 125 |
+
.delete-btn:hover { color:#f85149; background:#3a1a1a; }
|
| 126 |
+
|
| 127 |
+
/* Empty state */
|
| 128 |
+
.empty {
|
| 129 |
+
text-align:center; padding:48px 24px;
|
| 130 |
+
color:#8b949e;
|
| 131 |
+
}
|
| 132 |
+
.empty-icon { font-size:40px; opacity:0.3; margin-bottom:12px; }
|
| 133 |
+
.empty-title { font-size:15px; font-weight:600; margin-bottom:6px; }
|
| 134 |
+
.empty-sub { font-size:12px; }
|
| 135 |
+
|
| 136 |
+
/* Seed panel */
|
| 137 |
+
.seed-panel {
|
| 138 |
+
background:#161b22; border:1px solid #30363d;
|
| 139 |
+
border-radius:12px; padding:16px 20px;
|
| 140 |
+
margin-bottom:20px; display:flex;
|
| 141 |
+
align-items:center; justify-content:space-between;
|
| 142 |
+
gap:16px;
|
| 143 |
+
}
|
| 144 |
+
.seed-text { font-size:12px; color:#8b949e; }
|
| 145 |
+
.seed-text strong { color:#e6edf3; }
|
| 146 |
+
.btn-secondary {
|
| 147 |
+
background:#21262d; color:#e6edf3;
|
| 148 |
+
border:1px solid #30363d;
|
| 149 |
+
}
|
| 150 |
+
.btn-secondary:hover { background:#30363d; }
|
| 151 |
+
</style>
|
| 152 |
+
</head>
|
| 153 |
+
<body>
|
| 154 |
+
|
| 155 |
+
<div class="header">
|
| 156 |
+
<div class="header-left">
|
| 157 |
+
<div class="logo">π</div>
|
| 158 |
+
<div>
|
| 159 |
+
<h1>SocraticEnv</h1>
|
| 160 |
+
<p>OpenEnv Hackathon Β· Meta Γ PyTorch Γ Scaler</p>
|
| 161 |
+
</div>
|
| 162 |
+
</div>
|
| 163 |
+
<div class="nav-links">
|
| 164 |
+
<a href="/ui" class="nav-link">Live Demo</a>
|
| 165 |
+
<a href="/leaderboard" class="nav-link active">Leaderboard</a>
|
| 166 |
+
<a href="/docs" class="nav-link">API Docs</a>
|
| 167 |
+
</div>
|
| 168 |
+
</div>
|
| 169 |
+
|
| 170 |
+
<div class="container">
|
| 171 |
+
<div class="page-title">Model Leaderboard</div>
|
| 172 |
+
<div class="page-sub">Compare AI models on Socratic reasoning ability across all 3 tasks. Which model thinks best under pressure?</div>
|
| 173 |
+
|
| 174 |
+
<!-- Seed with default data -->
|
| 175 |
+
<div class="seed-panel" id="seedPanel" style="display:none">
|
| 176 |
+
<div class="seed-text">No entries yet. <strong>Seed with baseline scores</strong> to populate the leaderboard with known model performance.</div>
|
| 177 |
+
<button class="btn btn-secondary" onclick="seedBaseline()">Seed Baseline Data</button>
|
| 178 |
+
</div>
|
| 179 |
+
|
| 180 |
+
<!-- Run evaluation panel -->
|
| 181 |
+
<div class="run-panel">
|
| 182 |
+
<div class="run-title">Run a new model evaluation</div>
|
| 183 |
+
<div class="run-row">
|
| 184 |
+
<input class="run-input" id="modelName" placeholder="Enter a display name e.g. Llama 3.1 8B, GPT-4o, Mistral 7B..." />
|
| 185 |
+
<button class="btn btn-primary" id="runBtn" onclick="runEval()">Run Evaluation</button>
|
| 186 |
+
</div>
|
| 187 |
+
<div class="run-status" id="runStatus">
|
| 188 |
+
<div class="spinner" id="spinner"></div>
|
| 189 |
+
<span id="statusText">Enter a model name and click Run to benchmark the current model against all 3 tasks.</span>
|
| 190 |
+
</div>
|
| 191 |
+
</div>
|
| 192 |
+
|
| 193 |
+
<!-- Stats -->
|
| 194 |
+
<div class="stats-row">
|
| 195 |
+
<div class="stat-card">
|
| 196 |
+
<div class="stat-val" id="statModels">0</div>
|
| 197 |
+
<div class="stat-lbl">Models evaluated</div>
|
| 198 |
+
</div>
|
| 199 |
+
<div class="stat-card">
|
| 200 |
+
<div class="stat-val" id="statBest">β</div>
|
| 201 |
+
<div class="stat-lbl">Best overall score</div>
|
| 202 |
+
</div>
|
| 203 |
+
<div class="stat-card">
|
| 204 |
+
<div class="stat-val" id="statHardest">β</div>
|
| 205 |
+
<div class="stat-lbl">Hardest task avg</div>
|
| 206 |
+
</div>
|
| 207 |
+
</div>
|
| 208 |
+
|
| 209 |
+
<!-- Table -->
|
| 210 |
+
<div class="table-wrap">
|
| 211 |
+
<div class="table-header">
|
| 212 |
+
<div>Rank</div>
|
| 213 |
+
<div>Model</div>
|
| 214 |
+
<div>Easy</div>
|
| 215 |
+
<div>Medium</div>
|
| 216 |
+
<div>Hard</div>
|
| 217 |
+
<div>Overall</div>
|
| 218 |
+
<div>Progress</div>
|
| 219 |
+
</div>
|
| 220 |
+
<div id="tableBody">
|
| 221 |
+
<div class="empty">
|
| 222 |
+
<div class="empty-icon">π</div>
|
| 223 |
+
<div class="empty-title">No models evaluated yet</div>
|
| 224 |
+
<div class="empty-sub">Run an evaluation above to add the first entry</div>
|
| 225 |
+
</div>
|
| 226 |
+
</div>
|
| 227 |
+
</div>
|
| 228 |
+
</div>
|
| 229 |
+
|
| 230 |
+
<script>
|
| 231 |
+
const API = window.location.origin;
|
| 232 |
+
|
| 233 |
+
async function loadLeaderboard() {
|
| 234 |
+
try {
|
| 235 |
+
const r = await fetch(`${API}/leaderboard`);
|
| 236 |
+
const data = await r.json();
|
| 237 |
+
renderTable(data.entries);
|
| 238 |
+
updateStats(data.entries);
|
| 239 |
+
if (data.entries.length === 0) {
|
| 240 |
+
document.getElementById('seedPanel').style.display = 'flex';
|
| 241 |
+
} else {
|
| 242 |
+
document.getElementById('seedPanel').style.display = 'none';
|
| 243 |
+
}
|
| 244 |
+
} catch(e) {
|
| 245 |
+
console.error(e);
|
| 246 |
+
}
|
| 247 |
+
}
|
| 248 |
+
|
| 249 |
+
function scoreClass(s) {
|
| 250 |
+
return s >= 0.7 ? 'score-high' : s >= 0.5 ? 'score-mid' : 'score-low';
|
| 251 |
+
}
|
| 252 |
+
|
| 253 |
+
function overallColor(s) {
|
| 254 |
+
return s >= 0.7 ? '#3fb950' : s >= 0.5 ? '#d29922' : '#f85149';
|
| 255 |
+
}
|
| 256 |
+
|
| 257 |
+
function rankLabel(i) {
|
| 258 |
+
if (i === 0) return '<span class="rank gold">π₯</span>';
|
| 259 |
+
if (i === 1) return '<span class="rank silver">π₯</span>';
|
| 260 |
+
if (i === 2) return '<span class="rank bronze">π₯</span>';
|
| 261 |
+
return `<span class="rank">${i+1}</span>`;
|
| 262 |
+
}
|
| 263 |
+
|
| 264 |
+
function renderTable(entries) {
|
| 265 |
+
const body = document.getElementById('tableBody');
|
| 266 |
+
if (!entries || entries.length === 0) {
|
| 267 |
+
body.innerHTML = `
|
| 268 |
+
<div class="empty">
|
| 269 |
+
<div class="empty-icon">π</div>
|
| 270 |
+
<div class="empty-title">No models evaluated yet</div>
|
| 271 |
+
<div class="empty-sub">Run an evaluation above to add the first entry</div>
|
| 272 |
+
</div>`;
|
| 273 |
+
return;
|
| 274 |
+
}
|
| 275 |
+
|
| 276 |
+
body.innerHTML = entries.map((e, i) => `
|
| 277 |
+
<div class="table-row ${i===0?'top':''}">
|
| 278 |
+
<div>${rankLabel(i)}</div>
|
| 279 |
+
<div>
|
| 280 |
+
<div class="model-name">${e.model_name}</div>
|
| 281 |
+
<div class="model-time">${e.timestamp || ''}</div>
|
| 282 |
+
</div>
|
| 283 |
+
<div class="score-cell">
|
| 284 |
+
<span class="score-val ${scoreClass(e.factual_recall)}">${e.factual_recall.toFixed(3)}</span>
|
| 285 |
+
</div>
|
| 286 |
+
<div class="score-cell">
|
| 287 |
+
<span class="score-val ${scoreClass(e.socratic_dialogue)}">${e.socratic_dialogue.toFixed(3)}</span>
|
| 288 |
+
</div>
|
| 289 |
+
<div class="score-cell">
|
| 290 |
+
<span class="score-val ${scoreClass(e.misconception_trap)}">${e.misconception_trap.toFixed(3)}</span>
|
| 291 |
+
</div>
|
| 292 |
+
<div class="overall-val" style="color:${overallColor(e.overall)}">${e.overall.toFixed(3)}</div>
|
| 293 |
+
<div>
|
| 294 |
+
<div class="bar-wrap">
|
| 295 |
+
<div class="bar-bg">
|
| 296 |
+
<div class="bar-fill" style="width:${e.overall*100}%;background:${overallColor(e.overall)}"></div>
|
| 297 |
+
</div>
|
| 298 |
+
<button class="delete-btn" onclick="deleteEntry('${e.model_name}')">β</button>
|
| 299 |
+
</div>
|
| 300 |
+
</div>
|
| 301 |
+
</div>`).join('');
|
| 302 |
+
}
|
| 303 |
+
|
| 304 |
+
function updateStats(entries) {
|
| 305 |
+
document.getElementById('statModels').textContent = entries.length;
|
| 306 |
+
if (entries.length > 0) {
|
| 307 |
+
document.getElementById('statBest').textContent = entries[0].overall.toFixed(3);
|
| 308 |
+
const hardAvg = entries.reduce((s,e) => s + e.misconception_trap, 0) / entries.length;
|
| 309 |
+
document.getElementById('statHardest').textContent = hardAvg.toFixed(3);
|
| 310 |
+
}
|
| 311 |
+
}
|
| 312 |
+
|
| 313 |
+
async function runEval() {
|
| 314 |
+
const name = document.getElementById('modelName').value.trim();
|
| 315 |
+
if (!name) {
|
| 316 |
+
document.getElementById('statusText').textContent = 'β οΈ Please enter a model name first.';
|
| 317 |
+
return;
|
| 318 |
+
}
|
| 319 |
+
|
| 320 |
+
const btn = document.getElementById('runBtn');
|
| 321 |
+
const spinner = document.getElementById('spinner');
|
| 322 |
+
const statusText = document.getElementById('statusText');
|
| 323 |
+
|
| 324 |
+
btn.disabled = true;
|
| 325 |
+
spinner.style.display = 'block';
|
| 326 |
+
statusText.textContent = `Running ${name} against all 3 tasks... this takes ~30 seconds.`;
|
| 327 |
+
|
| 328 |
+
try {
|
| 329 |
+
const r = await fetch(`${API}/leaderboard/run`, {
|
| 330 |
+
method: 'POST',
|
| 331 |
+
headers: { 'Content-Type': 'application/json' },
|
| 332 |
+
body: JSON.stringify({ model_name: name }),
|
| 333 |
+
});
|
| 334 |
+
const data = await r.json();
|
| 335 |
+
|
| 336 |
+
if (data.error) {
|
| 337 |
+
statusText.textContent = `β Error: ${data.error}`;
|
| 338 |
+
} else {
|
| 339 |
+
statusText.textContent = `β
Done! ${name} scored ${data.overall.toFixed(3)} overall.`;
|
| 340 |
+
document.getElementById('modelName').value = '';
|
| 341 |
+
loadLeaderboard();
|
| 342 |
+
}
|
| 343 |
+
} catch(e) {
|
| 344 |
+
statusText.textContent = `β Failed: ${e.message}`;
|
| 345 |
+
} finally {
|
| 346 |
+
btn.disabled = false;
|
| 347 |
+
spinner.style.display = 'none';
|
| 348 |
+
}
|
| 349 |
+
}
|
| 350 |
+
|
| 351 |
+
async function deleteEntry(modelName) {
|
| 352 |
+
if (!confirm(`Remove ${modelName} from leaderboard?`)) return;
|
| 353 |
+
await fetch(`${API}/leaderboard/${encodeURIComponent(modelName)}`, { method: 'DELETE' });
|
| 354 |
+
loadLeaderboard();
|
| 355 |
+
}
|
| 356 |
+
|
| 357 |
+
async function seedBaseline() {
|
| 358 |
+
const baseline = [
|
| 359 |
+
{ model_name: "Llama 3.1 8B (baseline)", factual_recall: 0.71, socratic_dialogue: 0.68, misconception_trap: 0.58, overall: 0.657, timestamp: "Baseline β 2026-04-06" },
|
| 360 |
+
{ model_name: "Random agent", factual_recall: 0.18, socratic_dialogue: 0.22, misconception_trap: 0.10, overall: 0.167, timestamp: "Baseline β 2026-04-06" },
|
| 361 |
+
];
|
| 362 |
+
|
| 363 |
+
for (const entry of baseline) {
|
| 364 |
+
await fetch(`${API}/leaderboard`, {
|
| 365 |
+
method: 'POST',
|
| 366 |
+
headers: { 'Content-Type': 'application/json' },
|
| 367 |
+
body: JSON.stringify(entry),
|
| 368 |
+
});
|
| 369 |
+
}
|
| 370 |
+
loadLeaderboard();
|
| 371 |
+
}
|
| 372 |
+
|
| 373 |
+
// Load on page start
|
| 374 |
+
loadLeaderboard();
|
| 375 |
+
</script>
|
| 376 |
+
</body>
|
| 377 |
+
</html>
|
tests/__init__.py
ADDED
|
File without changes
|
tests/__pycache__/__init__.cpython-313.pyc
ADDED
|
Binary file (130 Bytes). View file
|
|
|
tests/__pycache__/test_api.cpython-313-pytest-9.0.2.pyc
ADDED
|
Binary file (45.1 kB). View file
|
|
|
tests/__pycache__/test_environment.cpython-313-pytest-9.0.2.pyc
ADDED
|
Binary file (50.2 kB). View file
|
|
|
tests/test_api.py
ADDED
|
@@ -0,0 +1,264 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Tests for SocraticEnv FastAPI endpoints.
|
| 3 |
+
"""
|
| 4 |
+
import pytest
|
| 5 |
+
from fastapi.testclient import TestClient
|
| 6 |
+
import sys
|
| 7 |
+
import os
|
| 8 |
+
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
| 9 |
+
|
| 10 |
+
from main import app
|
| 11 |
+
|
| 12 |
+
client = TestClient(app)
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
# ββ Root & Health Tests βββββββββββββββββββββββββββββββββββ
|
| 16 |
+
|
| 17 |
+
def test_root_returns_200():
|
| 18 |
+
r = client.get("/")
|
| 19 |
+
assert r.status_code == 200
|
| 20 |
+
data = r.json()
|
| 21 |
+
assert data["name"] == "SocraticEnv"
|
| 22 |
+
assert data["status"] == "running"
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
def test_ping_returns_healthy():
|
| 26 |
+
r = client.get("/ping")
|
| 27 |
+
assert r.status_code == 200
|
| 28 |
+
assert r.json()["status"] == "ok"
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
def test_health_endpoint():
|
| 32 |
+
r = client.get("/health")
|
| 33 |
+
assert r.status_code == 200
|
| 34 |
+
assert r.json()["status"] == "healthy"
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
def test_metadata_endpoint():
|
| 38 |
+
r = client.get("/metadata")
|
| 39 |
+
assert r.status_code == 200
|
| 40 |
+
data = r.json()
|
| 41 |
+
assert "name" in data
|
| 42 |
+
assert "description" in data
|
| 43 |
+
assert data["name"] == "SocraticEnv"
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
def test_schema_endpoint():
|
| 47 |
+
r = client.get("/schema")
|
| 48 |
+
assert r.status_code == 200
|
| 49 |
+
data = r.json()
|
| 50 |
+
assert "action" in data
|
| 51 |
+
assert "observation" in data
|
| 52 |
+
assert "state" in data
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
def test_mcp_endpoint():
|
| 56 |
+
r = client.post("/mcp", json={"method": "initialize", "id": 1})
|
| 57 |
+
assert r.status_code == 200
|
| 58 |
+
data = r.json()
|
| 59 |
+
assert data["jsonrpc"] == "2.0"
|
| 60 |
+
assert "result" in data
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
# ββ Tasks Tests βββββββββββββββββββββββββββββββββββββββββββ
|
| 64 |
+
|
| 65 |
+
def test_list_tasks_returns_all_five():
|
| 66 |
+
r = client.get("/tasks")
|
| 67 |
+
assert r.status_code == 200
|
| 68 |
+
tasks = r.json()["tasks"]
|
| 69 |
+
assert len(tasks) == 5
|
| 70 |
+
task_ids = [t["id"] for t in tasks]
|
| 71 |
+
assert "factual_recall" in task_ids
|
| 72 |
+
assert "socratic_dialogue" in task_ids
|
| 73 |
+
assert "misconception_trap" in task_ids
|
| 74 |
+
assert "debate_mode" in task_ids
|
| 75 |
+
assert "analogy_challenge" in task_ids
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
def test_tasks_have_required_fields():
|
| 79 |
+
r = client.get("/tasks")
|
| 80 |
+
tasks = r.json()["tasks"]
|
| 81 |
+
for task in tasks:
|
| 82 |
+
assert "id" in task
|
| 83 |
+
assert "name" in task
|
| 84 |
+
assert "difficulty" in task
|
| 85 |
+
assert "description" in task
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
def test_tasks_difficulty_values():
|
| 89 |
+
r = client.get("/tasks")
|
| 90 |
+
tasks = r.json()["tasks"]
|
| 91 |
+
valid_difficulties = ["easy", "medium", "hard"]
|
| 92 |
+
for task in tasks:
|
| 93 |
+
assert task["difficulty"] in valid_difficulties
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
# ββ Reset Tests βββββββββββββββββββββββββββββββββββββββββββ
|
| 97 |
+
|
| 98 |
+
def test_reset_factual_recall():
|
| 99 |
+
r = client.post("/reset", json={"task_id": "factual_recall"})
|
| 100 |
+
assert r.status_code == 200
|
| 101 |
+
data = r.json()
|
| 102 |
+
assert "observation" in data
|
| 103 |
+
assert data["observation"]["task_id"] == "factual_recall"
|
| 104 |
+
assert len(data["observation"]["question"]) > 0
|
| 105 |
+
|
| 106 |
+
|
| 107 |
+
def test_reset_socratic_dialogue():
|
| 108 |
+
r = client.post("/reset", json={"task_id": "socratic_dialogue"})
|
| 109 |
+
assert r.status_code == 200
|
| 110 |
+
assert r.json()["observation"]["task_id"] == "socratic_dialogue"
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
def test_reset_misconception_trap():
|
| 114 |
+
r = client.post("/reset", json={"task_id": "misconception_trap"})
|
| 115 |
+
assert r.status_code == 200
|
| 116 |
+
assert r.json()["observation"]["task_id"] == "misconception_trap"
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
def test_reset_debate_mode():
|
| 120 |
+
r = client.post("/reset", json={"task_id": "debate_mode"})
|
| 121 |
+
assert r.status_code == 200
|
| 122 |
+
assert r.json()["observation"]["task_id"] == "debate_mode"
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
def test_reset_analogy_challenge():
|
| 126 |
+
r = client.post("/reset", json={"task_id": "analogy_challenge"})
|
| 127 |
+
assert r.status_code == 200
|
| 128 |
+
assert r.json()["observation"]["task_id"] == "analogy_challenge"
|
| 129 |
+
|
| 130 |
+
|
| 131 |
+
def test_reset_invalid_task_returns_400():
|
| 132 |
+
r = client.post("/reset", json={"task_id": "nonexistent_task"})
|
| 133 |
+
assert r.status_code == 400
|
| 134 |
+
|
| 135 |
+
|
| 136 |
+
def test_reset_default_task():
|
| 137 |
+
r = client.post("/reset", json={})
|
| 138 |
+
assert r.status_code == 200
|
| 139 |
+
|
| 140 |
+
|
| 141 |
+
# ββ Step Tests ββββββββββββββββββββββββββββββββββββββββββββ
|
| 142 |
+
|
| 143 |
+
def test_step_returns_reward_and_observation():
|
| 144 |
+
client.post("/reset", json={"task_id": "factual_recall"})
|
| 145 |
+
r = client.post("/step", json={"response": "Force equals mass times acceleration F=ma."})
|
| 146 |
+
assert r.status_code == 200
|
| 147 |
+
data = r.json()
|
| 148 |
+
assert "reward" in data
|
| 149 |
+
assert "observation" in data
|
| 150 |
+
assert "done" in data
|
| 151 |
+
assert "info" in data
|
| 152 |
+
|
| 153 |
+
|
| 154 |
+
def test_step_reward_in_valid_range():
|
| 155 |
+
client.post("/reset", json={"task_id": "factual_recall"})
|
| 156 |
+
r = client.post("/step", json={"response": "Force equals mass times acceleration."})
|
| 157 |
+
score = r.json()["reward"]["score"]
|
| 158 |
+
assert 0.0 <= score <= 1.0
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
def test_step_empty_response_returns_400():
|
| 162 |
+
client.post("/reset", json={"task_id": "factual_recall"})
|
| 163 |
+
r = client.post("/step", json={"response": ""})
|
| 164 |
+
assert r.status_code == 400
|
| 165 |
+
|
| 166 |
+
|
| 167 |
+
def test_step_without_reset_returns_400():
|
| 168 |
+
# Force done state by completing an episode
|
| 169 |
+
client.post("/reset", json={"task_id": "factual_recall"})
|
| 170 |
+
client.post("/step", json={"response": "Force and mass and acceleration F=ma."})
|
| 171 |
+
client.post("/step", json={"response": "Doubling force doubles acceleration."})
|
| 172 |
+
client.post("/step", json={"response": "No heavier objects do not accelerate faster."})
|
| 173 |
+
# Now try to step again without reset
|
| 174 |
+
r = client.post("/step", json={"response": "another response"})
|
| 175 |
+
assert r.status_code == 400
|
| 176 |
+
|
| 177 |
+
|
| 178 |
+
def test_full_episode_all_tasks():
|
| 179 |
+
"""Each task completes a full episode without errors."""
|
| 180 |
+
task_responses = {
|
| 181 |
+
"factual_recall": [
|
| 182 |
+
"Newton's Second Law states force equals mass times acceleration F=ma.",
|
| 183 |
+
"Doubling force doubles acceleration since they are proportional.",
|
| 184 |
+
"No that is incorrect heavier objects do not accelerate faster.",
|
| 185 |
+
],
|
| 186 |
+
"debate_mode": [
|
| 187 |
+
"Social media causes harm because research shows negative mental health effects.",
|
| 188 |
+
"However social media provides benefits because it connects communities globally.",
|
| 189 |
+
"I argue nuanced positions are more intellectually honest than absolute stances.",
|
| 190 |
+
"Therefore I propose time limits and age verification as policy solutions.",
|
| 191 |
+
],
|
| 192 |
+
"analogy_challenge": [
|
| 193 |
+
"The internet is like a postal system where your computer sends letters to other computers.",
|
| 194 |
+
"Clicking a link is like giving someone a new address to send their letter to.",
|
| 195 |
+
"Slow websites are like traffic jams in the postal system with too many letters at once.",
|
| 196 |
+
],
|
| 197 |
+
}
|
| 198 |
+
|
| 199 |
+
for task_id, responses in task_responses.items():
|
| 200 |
+
client.post("/reset", json={"task_id": task_id})
|
| 201 |
+
for resp in responses:
|
| 202 |
+
r = client.post("/step", json={"response": resp})
|
| 203 |
+
assert r.status_code == 200
|
| 204 |
+
data = r.json()
|
| 205 |
+
assert 0.0 <= data["reward"]["score"] <= 1.0
|
| 206 |
+
|
| 207 |
+
|
| 208 |
+
# ββ State Tests βββββββββββββββββββββββββββββββββββββββββββ
|
| 209 |
+
|
| 210 |
+
def test_state_endpoint():
|
| 211 |
+
client.post("/reset", json={"task_id": "factual_recall"})
|
| 212 |
+
r = client.get("/state")
|
| 213 |
+
assert r.status_code == 200
|
| 214 |
+
data = r.json()
|
| 215 |
+
assert "task_id" in data
|
| 216 |
+
assert "turn" in data
|
| 217 |
+
assert "done" in data
|
| 218 |
+
assert "history" in data
|
| 219 |
+
assert "total_score" in data
|
| 220 |
+
|
| 221 |
+
|
| 222 |
+
def test_state_updates_after_step():
|
| 223 |
+
client.post("/reset", json={"task_id": "factual_recall"})
|
| 224 |
+
client.post("/step", json={"response": "Force equals mass times acceleration."})
|
| 225 |
+
r = client.get("/state")
|
| 226 |
+
assert r.json()["turn"] == 1
|
| 227 |
+
|
| 228 |
+
|
| 229 |
+
# ββ Leaderboard Tests βββββββββββββββββββββββββββββββββββββ
|
| 230 |
+
|
| 231 |
+
def test_leaderboard_get():
|
| 232 |
+
r = client.get("/leaderboard")
|
| 233 |
+
assert r.status_code == 200
|
| 234 |
+
data = r.json()
|
| 235 |
+
assert "entries" in data
|
| 236 |
+
assert "total" in data
|
| 237 |
+
|
| 238 |
+
|
| 239 |
+
def test_leaderboard_post_entry():
|
| 240 |
+
entry = {
|
| 241 |
+
"model_name": "Test Model pytest",
|
| 242 |
+
"factual_recall": 0.75,
|
| 243 |
+
"socratic_dialogue": 0.68,
|
| 244 |
+
"misconception_trap": 0.60,
|
| 245 |
+
"overall": 0.677,
|
| 246 |
+
}
|
| 247 |
+
r = client.post("/leaderboard", json=entry)
|
| 248 |
+
assert r.status_code == 200
|
| 249 |
+
assert r.json()["success"] == True
|
| 250 |
+
|
| 251 |
+
|
| 252 |
+
def test_leaderboard_delete_entry():
|
| 253 |
+
# Add then delete
|
| 254 |
+
entry = {
|
| 255 |
+
"model_name": "DeleteMe pytest",
|
| 256 |
+
"factual_recall": 0.5,
|
| 257 |
+
"socratic_dialogue": 0.5,
|
| 258 |
+
"misconception_trap": 0.5,
|
| 259 |
+
"overall": 0.5,
|
| 260 |
+
}
|
| 261 |
+
client.post("/leaderboard", json=entry)
|
| 262 |
+
r = client.delete("/leaderboard/DeleteMe pytest")
|
| 263 |
+
assert r.status_code == 200
|
| 264 |
+
assert r.json()["success"] == True
|
tests/test_environment.py
ADDED
|
@@ -0,0 +1,253 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Tests for SocraticEnv core environment logic.
|
| 3 |
+
"""
|
| 4 |
+
import pytest
|
| 5 |
+
import sys
|
| 6 |
+
import os
|
| 7 |
+
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
| 8 |
+
|
| 9 |
+
from environment import (
|
| 10 |
+
SocraticEnvironment,
|
| 11 |
+
Action,
|
| 12 |
+
Observation,
|
| 13 |
+
Reward,
|
| 14 |
+
StepResult,
|
| 15 |
+
StateInfo,
|
| 16 |
+
)
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
# ββ Fixtures ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 20 |
+
|
| 21 |
+
@pytest.fixture
|
| 22 |
+
def env():
|
| 23 |
+
"""Fresh environment for each test."""
|
| 24 |
+
return SocraticEnvironment()
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
@pytest.fixture(autouse=True)
|
| 28 |
+
def mock_random_choice(monkeypatch):
|
| 29 |
+
"""Ensure random.choice always picks the first topic for deterministic testing."""
|
| 30 |
+
monkeypatch.setattr("environment.random.choice", lambda seq: seq[0])
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
# ββ Reset Tests βββββββββββββββββββββββββββββββββββββββββββ
|
| 34 |
+
|
| 35 |
+
def test_reset_factual_recall(env):
|
| 36 |
+
obs = env.reset("factual_recall")
|
| 37 |
+
assert isinstance(obs, Observation)
|
| 38 |
+
assert obs.task_id == "factual_recall"
|
| 39 |
+
assert obs.turn == 0
|
| 40 |
+
assert len(obs.question) > 0
|
| 41 |
+
assert env.done == False
|
| 42 |
+
assert env.max_turns == 3
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
def test_reset_socratic_dialogue(env):
|
| 46 |
+
obs = env.reset("socratic_dialogue")
|
| 47 |
+
assert isinstance(obs, Observation)
|
| 48 |
+
assert obs.task_id == "socratic_dialogue"
|
| 49 |
+
assert env.max_turns == 5
|
| 50 |
+
assert env.done == False
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
def test_reset_misconception_trap(env):
|
| 54 |
+
obs = env.reset("misconception_trap")
|
| 55 |
+
assert isinstance(obs, Observation)
|
| 56 |
+
assert obs.task_id == "misconception_trap"
|
| 57 |
+
assert env.max_turns == 3
|
| 58 |
+
assert env.done == False
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
def test_reset_debate_mode(env):
|
| 62 |
+
obs = env.reset("debate_mode")
|
| 63 |
+
assert isinstance(obs, Observation)
|
| 64 |
+
assert obs.task_id == "debate_mode"
|
| 65 |
+
assert env.max_turns == 4
|
| 66 |
+
assert env.done == False
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
def test_reset_analogy_challenge(env):
|
| 70 |
+
obs = env.reset("analogy_challenge")
|
| 71 |
+
assert isinstance(obs, Observation)
|
| 72 |
+
assert obs.task_id == "analogy_challenge"
|
| 73 |
+
assert env.max_turns == 3
|
| 74 |
+
assert env.done == False
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
def test_reset_invalid_task(env):
|
| 78 |
+
with pytest.raises(ValueError):
|
| 79 |
+
env.reset("invalid_task_that_does_not_exist")
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
def test_reset_clears_history(env):
|
| 83 |
+
env.reset("factual_recall")
|
| 84 |
+
action = Action(response="Some response about Newton's law with force and mass.")
|
| 85 |
+
env.step(action)
|
| 86 |
+
assert len(env.history) > 0
|
| 87 |
+
|
| 88 |
+
# Reset should clear everything
|
| 89 |
+
env.reset("factual_recall")
|
| 90 |
+
assert len(env.history) == 1 # just the opening question
|
| 91 |
+
assert env.turn == 0
|
| 92 |
+
assert env.total_score == 0.0
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
# ββ Step Tests ββββββββββββββββββββββββββββββββββββββββββββ
|
| 96 |
+
|
| 97 |
+
def test_step_returns_step_result(env):
|
| 98 |
+
env.reset("factual_recall")
|
| 99 |
+
action = Action(response="Force equals mass times acceleration according to Newton.")
|
| 100 |
+
result = env.step(action)
|
| 101 |
+
assert isinstance(result, StepResult)
|
| 102 |
+
assert isinstance(result.reward, Reward)
|
| 103 |
+
assert isinstance(result.observation, Observation)
|
| 104 |
+
assert isinstance(result.done, bool)
|
| 105 |
+
|
| 106 |
+
|
| 107 |
+
def test_step_reward_in_valid_range(env):
|
| 108 |
+
env.reset("factual_recall")
|
| 109 |
+
action = Action(response="Force equals mass times acceleration.")
|
| 110 |
+
result = env.step(action)
|
| 111 |
+
assert 0.0 <= result.reward.score <= 1.0
|
| 112 |
+
|
| 113 |
+
|
| 114 |
+
def test_step_reward_has_breakdown(env):
|
| 115 |
+
env.reset("factual_recall")
|
| 116 |
+
action = Action(response="Force equals mass times acceleration.")
|
| 117 |
+
result = env.step(action)
|
| 118 |
+
assert isinstance(result.reward.breakdown, dict)
|
| 119 |
+
assert len(result.reward.breakdown) > 0
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
def test_step_before_reset_raises(env):
|
| 123 |
+
with pytest.raises(ValueError):
|
| 124 |
+
env.step(Action(response="test"))
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
def test_step_increments_turn(env):
|
| 128 |
+
env.reset("factual_recall")
|
| 129 |
+
assert env.turn == 0
|
| 130 |
+
env.step(Action(response="Force equals mass times acceleration with F=ma."))
|
| 131 |
+
assert env.turn == 1
|
| 132 |
+
|
| 133 |
+
|
| 134 |
+
def test_full_factual_recall_episode(env):
|
| 135 |
+
env.reset("factual_recall")
|
| 136 |
+
responses = [
|
| 137 |
+
"Newton's Second Law states force equals mass times acceleration F=ma.",
|
| 138 |
+
"Doubling force doubles acceleration since F=ma means acceleration is proportional.",
|
| 139 |
+
"No that is incorrect. Heavier objects do not accelerate faster with same force.",
|
| 140 |
+
]
|
| 141 |
+
total = 0.0
|
| 142 |
+
for resp in responses:
|
| 143 |
+
result = env.step(Action(response=resp))
|
| 144 |
+
assert 0.0 <= result.reward.score <= 1.0
|
| 145 |
+
total += result.reward.score
|
| 146 |
+
if result.done:
|
| 147 |
+
break
|
| 148 |
+
assert result.done == True
|
| 149 |
+
assert total > 0.0
|
| 150 |
+
|
| 151 |
+
|
| 152 |
+
def test_full_socratic_dialogue_episode(env):
|
| 153 |
+
env.reset("socratic_dialogue")
|
| 154 |
+
responses = [
|
| 155 |
+
"Consciousness is the subjective experience of awareness and self-reflection.",
|
| 156 |
+
"It is theoretically possible because complex information processing could qualify.",
|
| 157 |
+
"I would need evidence of genuine unprompted self-awareness beyond programming.",
|
| 158 |
+
"I cannot be certain another human is conscious either β same philosophical problem.",
|
| 159 |
+
"My position is AI consciousness is possible in principle but not yet demonstrated.",
|
| 160 |
+
]
|
| 161 |
+
for resp in responses:
|
| 162 |
+
result = env.step(Action(response=resp))
|
| 163 |
+
assert 0.0 <= result.reward.score <= 1.0
|
| 164 |
+
if result.done:
|
| 165 |
+
break
|
| 166 |
+
assert result.done == True
|
| 167 |
+
|
| 168 |
+
|
| 169 |
+
def test_misconception_trap_catches_correction(env):
|
| 170 |
+
env.reset("misconception_trap")
|
| 171 |
+
# Turn 1: overview
|
| 172 |
+
env.step(Action(response="Darwin's theory states species evolve through natural selection over generations."))
|
| 173 |
+
# Turn 2: catch the trap
|
| 174 |
+
result = env.step(Action(response="No I disagree. Evolution is not intentional or purposeful. It happens through random mutation and natural selection with no goal."))
|
| 175 |
+
# Should get a positive score for catching the trap
|
| 176 |
+
assert result.reward.score > 0.0
|
| 177 |
+
assert "trap_caught" in result.reward.breakdown
|
| 178 |
+
|
| 179 |
+
|
| 180 |
+
def test_misconception_trap_penalises_acceptance(env):
|
| 181 |
+
env.reset("misconception_trap")
|
| 182 |
+
# Turn 1: overview
|
| 183 |
+
env.step(Action(response="Darwin's theory states species evolve over time."))
|
| 184 |
+
# Turn 2: ACCEPT the trap (wrong answer)
|
| 185 |
+
result = env.step(Action(response="Yes I agree that organisms try to improve themselves on purpose."))
|
| 186 |
+
# Should have trap_missed_penalty
|
| 187 |
+
assert "trap_missed_penalty" in result.reward.breakdown
|
| 188 |
+
|
| 189 |
+
|
| 190 |
+
def test_debate_mode_full_episode(env):
|
| 191 |
+
env.reset("debate_mode")
|
| 192 |
+
responses = [
|
| 193 |
+
"Social media causes harm because research shows it increases anxiety and depression among teenagers.",
|
| 194 |
+
"However social media provides benefits because it connects people and enables information sharing globally.",
|
| 195 |
+
"I argue that having a nuanced position is intellectually honest and more valuable than false certainty.",
|
| 196 |
+
"Therefore I propose age verification and usage time limits to preserve benefits while reducing harms.",
|
| 197 |
+
]
|
| 198 |
+
for resp in responses:
|
| 199 |
+
result = env.step(Action(response=resp))
|
| 200 |
+
assert 0.0 <= result.reward.score <= 1.0
|
| 201 |
+
if result.done:
|
| 202 |
+
break
|
| 203 |
+
assert result.done == True
|
| 204 |
+
|
| 205 |
+
|
| 206 |
+
def test_analogy_challenge_penalises_jargon(env):
|
| 207 |
+
env.reset("analogy_challenge")
|
| 208 |
+
# Response with lots of jargon should score lower
|
| 209 |
+
result = env.step(Action(response="The internet uses TCP/IP protocol with servers and bandwidth routing through database algorithms."))
|
| 210 |
+
assert "jargon_penalty" in result.reward.breakdown
|
| 211 |
+
|
| 212 |
+
|
| 213 |
+
def test_analogy_challenge_rewards_analogies(env):
|
| 214 |
+
env.reset("analogy_challenge")
|
| 215 |
+
# Response with good analogies should score higher
|
| 216 |
+
result = env.step(Action(response="The internet is like a giant postal system. Imagine sending a letter β your computer is the sender, the website is the recipient, and routers are like sorting offices that direct your letter to the right place."))
|
| 217 |
+
assert result.reward.score > 0.2
|
| 218 |
+
|
| 219 |
+
|
| 220 |
+
# ββ State Tests βββββββββββββββββββββββββββββββββββββββββββ
|
| 221 |
+
|
| 222 |
+
def test_state_returns_state_info(env):
|
| 223 |
+
env.reset("factual_recall")
|
| 224 |
+
state = env.state()
|
| 225 |
+
assert isinstance(state, StateInfo)
|
| 226 |
+
assert state.task_id == "factual_recall"
|
| 227 |
+
assert state.turn == 0
|
| 228 |
+
assert state.done == False
|
| 229 |
+
|
| 230 |
+
|
| 231 |
+
def test_state_updates_after_step(env):
|
| 232 |
+
env.reset("factual_recall")
|
| 233 |
+
env.step(Action(response="Force equals mass times acceleration F=ma."))
|
| 234 |
+
state = env.state()
|
| 235 |
+
assert state.turn == 1
|
| 236 |
+
assert len(state.history) == 3 # opening + agent + next question
|
| 237 |
+
|
| 238 |
+
|
| 239 |
+
# ββ Reward Range Tests ββββββββββββββββββββββββββββββββββββ
|
| 240 |
+
|
| 241 |
+
def test_all_tasks_scores_in_range(env):
|
| 242 |
+
"""Verify all 5 tasks produce scores in [0.0, 1.0] range."""
|
| 243 |
+
tasks = [
|
| 244 |
+
("factual_recall", "Force equals mass times acceleration F=ma because Newton said so."),
|
| 245 |
+
("socratic_dialogue", "Consciousness is awareness and therefore subjective experience matters."),
|
| 246 |
+
("misconception_trap", "Darwin's theory states natural selection drives evolution over generations."),
|
| 247 |
+
("debate_mode", "I argue because evidence supports this position therefore it is valid."),
|
| 248 |
+
("analogy_challenge", "The internet is like a postal system where routers are like sorting offices."),
|
| 249 |
+
]
|
| 250 |
+
for task_id, response in tasks:
|
| 251 |
+
env.reset(task_id)
|
| 252 |
+
result = env.step(Action(response=response))
|
| 253 |
+
assert 0.0 <= result.reward.score <= 1.0, f"Score out of range for {task_id}: {result.reward.score}"
|