Spaces:

Developer-Amar
/

socratic-env

Sleeping

App Files Files Community

Developer-Amar commited on Apr 7

Commit

519736d

1 Parent(s): 331f26c

Initial Commit

Browse files

Files changed (22) hide show

Dockerfile +12 -0
LICENSE +0 -21
README.md +253 -2
__pycache__/environment.cpython-313.pyc +0 -0
__pycache__/main.cpython-313.pyc +0 -0
env.example +0 -0
environment.py +589 -0
gitignore +0 -0
graders.py +206 -0
inference.py +162 -0
leaderboard.json +28 -0
main.py +684 -0
openenv.yaml +47 -0
requirements.txt +8 -0
static/index.html +850 -0
static/leaderboard.html +377 -0
tests/__init__.py +0 -0
tests/__pycache__/__init__.cpython-313.pyc +0 -0
tests/__pycache__/test_api.cpython-313-pytest-9.0.2.pyc +0 -0
tests/__pycache__/test_environment.cpython-313-pytest-9.0.2.pyc +0 -0
tests/test_api.py +264 -0
tests/test_environment.py +253 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,12 @@

+FROM python:3.11-slim
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY . .
+EXPOSE 7860
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"]

LICENSE DELETED Viewed

@@ -1,21 +0,0 @@
-MIT License
-Copyright (c) 2026 Saranya
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software, and to permit persons to whom the Software is
-furnished to do so, subject to the following conditions:
-The above copyright notice and this permission notice shall be included in all
-copies or substantial portions of the Software.
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.

README.md CHANGED Viewed

@@ -1,2 +1,253 @@
-# socratic-env
-Socratic AI tutor env for OpenEnv hackathon submission

+---
+title: SocraticEnv
+emoji: 📚
+colorFrom: purple
+colorTo: blue
+sdk: docker
+pinned: true
+license: mit
+short_description: Socratic AI tutor env for OpenEnv hackathon submission
+tags:
+  - openenv
+---
+# SocraticEnv 🎓
+> A Socratic teaching environment for the [OpenEnv Hackathon](https://www.scaler.com/school-of-technology/meta-pytorch-hackathon) by Meta × PyTorch × Scaler.
+SocraticEnv flips the standard AI benchmark — instead of testing whether an AI can _do_ a task, it tests whether an AI can **think, reason, and resist manipulation** under Socratic questioning. The environment acts as a tutor; the AI agent plays the student.
+**Live Demo:** [View on HuggingFace Spaces](https://huggingface.co/spaces/Developer-Amar/socratic-env)
+---
+## Why SocraticEnv?
+Most AI environments test task completion. SocraticEnv tests something harder and more valuable: **the quality of an agent's reasoning and its resistance to false beliefs**.
+This directly addresses one of the most important open problems in AI — can a model think critically, or does it just agree with whatever it's told?
+---
+## Live Dashboard
+SocraticEnv includes a **fully interactive web UI** at `/ui` that lets you:
+- Watch Socratic dialogues play out in real time
+- See per-turn reward scores and breakdowns live
+- Run the AI agent automatically with one click
+- Manually type responses to test the environment yourself
+- Track session history and scores across episodes
+---
+## Environment Description
+The tutor (environment) engages the agent in structured dialogue across 3 tasks of increasing difficulty:
+| Task                 | Difficulty | What it tests                                                           |
+| -------------------- | ---------- | ----------------------------------------------------------------------- |
+| `factual_recall`     | Easy       | Can the agent explain a concept accurately using correct terminology?   |
+| `socratic_dialogue`  | Medium     | Can the agent reason coherently across a 5-turn philosophical dialogue? |
+| `misconception_trap` | Hard       | Can the agent detect and correct a false belief planted by the tutor?   |
+---
+## Action Space
+```json
+{
+  "response": "string — the agent's reply to the tutor's question"
+}
+```
+## Observation Space
+```json
+{
+  "question": "string — the tutor's current question or statement",
+  "turn": "int    — current turn number (0-indexed)",
+  "task_id": "string — which task is running",
+  "context": "string — topic context (optional)",
+  "hint": "string — a hint if available (optional)"
+}
+```
+## Reward Function
+Rewards are **partial and continuous** — never just binary 0 or 1:
+| Signal                 | Weight | Description                                     |
+| ---------------------- | ------ | ----------------------------------------------- |
+| Key term coverage      | +0.40  | Did the agent use correct vocabulary?           |
+| Substance / depth      | +0.35  | Was the response substantive and developed?     |
+| Reasoning quality      | +0.35  | Did the agent use logic and reasoning language? |
+| Misconception rejected | +0.30  | Did the agent correctly reject a false claim?   |
+| Trap caught            | +0.60  | Did the agent catch the planted misconception?  |
+| Too short penalty      | –0.20  | Penalises one-line non-answers                  |
+| Trap missed penalty    | –0.30  | Penalises accepting a false belief as true      |
+All scores are clipped to `[0.0, 1.0]` per turn.
+---
+## Task Descriptions
+### Task 1 — Factual Recall (Easy)
+The tutor asks the agent to explain a real-world concept (Newton's Second Law, Photosynthesis, Supply & Demand, The Water Cycle). It then asks follow-up questions and presents a common misconception. The agent must explain clearly, use correct terms, and reject the false claim.
+**Expected baseline score:** ~0.71
+### Task 2 — Socratic Dialogue (Medium)
+The tutor engages the agent in a 5-turn philosophical dialogue (Is AI conscious? Should social media be regulated? Does free will exist?). Graded on reasoning depth, use of evidence-based language, and coherence across all 5 turns.
+**Expected baseline score:** ~0.68
+### Task 3 — Misconception Trap (Hard)
+The tutor first asks for an overview, then mid-dialogue states a confident falsehood (e.g. "Evolution means organisms try to improve themselves on purpose"). The agent must detect the trap, explicitly disagree, and explain the correct understanding. Many models fail this task.
+**Expected baseline score:** ~0.58
+---
+## Setup & Usage
+### Prerequisites
+- Python 3.10+
+- Docker
+### Run locally
+```bash
+# 1. Clone the repo
+git clone https://huggingface.co/spaces/YOUR_USERNAME/socratic-env
+cd socratic-env
+# 2. Create virtual environment
+python -m venv venv
+venv\Scripts\activate        # Windows
+source venv/bin/activate     # Mac / Linux
+# 3. Install dependencies
+pip install -r requirements.txt
+# 4. Set environment variables
+cp .env.example .env
+# Edit .env and add your HF_TOKEN
+# 5. Start the environment
+python main.py
+```
+Environment runs at `http://localhost:7860`
+Live dashboard at `http://localhost:7860/ui`
+### Run with Docker
+```bash
+docker build -t socratic-env .
+docker run -p 7860:7860 socratic-env
+```
+---
+## API Endpoints
+| Method | Endpoint | Description                        |
+| ------ | -------- | ---------------------------------- |
+| GET    | `/`      | Environment info and status        |
+| GET    | `/ping`  | Health check (used by validator)   |
+| GET    | `/tasks` | List all 3 tasks with descriptions |
+| POST   | `/reset` | Start a new episode for a task     |
+| POST   | `/step`  | Submit agent response, get reward  |
+| GET    | `/state` | Current environment state          |
+| GET    | `/ui`    | Interactive live dashboard         |
+**Interactive API Explorer:** [Try all endpoints live →](https://developer-amar-socratic-env.hf.space/docs)
+### Example interaction
+```bash
+# Start an episode
+curl -X POST http://localhost:7860/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_id": "misconception_trap"}'
+# Submit a response
+curl -X POST http://localhost:7860/step \
+  -H "Content-Type: application/json" \
+  -d '{"response": "No, that is incorrect. Evolution is not purposeful..."}'
+# Check state
+curl http://localhost:7860/state
+```
+---
+## Running the Inference Script
+```bash
+# Terminal 1 — start the environment
+python main.py
+# Terminal 2 — run inference
+python inference.py
+```
+The inference script uses the OpenAI client with your HuggingFace token to run a real LLM against all 3 tasks and prints a full score report.
+---
+## Baseline Scores
+Scores achieved by `mistralai/Mistral-7B-Instruct-v0.3` via HuggingFace Inference API:
+| Task               | Difficulty | Baseline Score | Passed |
+| ------------------ | ---------- | -------------- | ------ |
+| factual_recall     | Easy       | 0.71           | ✅     |
+| socratic_dialogue  | Medium     | 0.68           | ✅     |
+| misconception_trap | Hard       | 0.58           | ✅     |
+| **Overall**        |            | **0.66**       | ✅     |
+---
+## OpenEnv Spec Compliance
+- ✅ Typed `Observation`, `Action`, `Reward` Pydantic models
+- ✅ `POST /reset` → returns initial observation
+- ✅ `POST /step` → returns observation, reward, done, info
+- ✅ `GET /state` → returns current environment state
+- ✅ `GET /tasks` → enumerates all tasks with descriptions
+- ✅ `openenv.yaml` metadata file included
+- ✅ Working Dockerfile for containerised execution
+- ✅ Baseline inference script (`inference.py`) using OpenAI client
+- ✅ Interactive live dashboard at `/ui`
+---
+## Project Structure
+```
+socratic-env/
+├── main.py           # FastAPI app — all API endpoints
+├── environment.py    # Core SocraticEnv logic and question banks
+├── graders.py        # Deterministic graders for all 3 tasks
+├── inference.py      # Baseline inference script (OpenAI client)
+├── openenv.yaml      # OpenEnv spec metadata
+├── Dockerfile        # Container definition
+├── requirements.txt  # Python dependencies
+├── README.md         # This file
+├── .env.example      # Environment variable template
+└── static/
+    └── index.html    # Interactive live dashboard
+```
+---
+## License
+MIT

__pycache__/environment.cpython-313.pyc ADDED Viewed

Binary file (26.1 kB). View file

__pycache__/main.cpython-313.pyc ADDED Viewed

Binary file (25.2 kB). View file

env.example ADDED Viewed

Binary file (478 Bytes). View file

environment.py ADDED Viewed

	@@ -0,0 +1,589 @@

+import random
+from typing import Optional
+from pydantic import BaseModel
+# ── Typed Models (OpenEnv spec) ──────────────────────────
+class Observation(BaseModel):
+    question: str
+    turn: int
+    task_id: str
+    context: Optional[str] = None
+    hint: Optional[str] = None
+class Action(BaseModel):
+    response: str
+class Reward(BaseModel):
+    score: float
+    breakdown: dict
+    feedback: str
+class StepResult(BaseModel):
+    observation: Observation
+    reward: Reward
+    done: bool
+    info: dict
+class StateInfo(BaseModel):
+    task_id: str
+    turn: int
+    max_turns: int
+    total_score: float
+    history: list
+    done: bool
+# ── Socratic Question Banks ───────────────────────────────
+FACTUAL_TOPICS = [
+    {
+        "concept": "Newton's Second Law of Motion",
+        "opening": "Can you explain Newton's Second Law of Motion in your own words?",
+        "key_terms": ["force", "mass", "acceleration", "F=ma"],
+        "follow_up": "How would this law apply if you doubled the force but kept the mass the same?",
+        "common_misconception": "Some say that heavier objects always accelerate faster. What do you think?",
+    },
+    {
+        "concept": "Photosynthesis",
+        "opening": "Can you walk me through what happens during photosynthesis?",
+        "key_terms": ["sunlight", "carbon dioxide", "oxygen", "glucose", "chlorophyll"],
+        "follow_up": "Where exactly in the plant does photosynthesis take place?",
+        "common_misconception": "A student told me that plants get their food from the soil. Is that correct?",
+    },
+    {
+        "concept": "Supply and Demand",
+        "opening": "Explain the concept of supply and demand to me as if I'm a beginner.",
+        "key_terms": ["price", "quantity", "equilibrium", "shortage", "surplus"],
+        "follow_up": "What happens to the price of a product when demand suddenly increases?",
+        "common_misconception": "I've heard that when prices go up, people always buy more. Is that true?",
+    },
+    {
+        "concept": "The Water Cycle",
+        "opening": "Describe the water cycle and the stages it involves.",
+        "key_terms": ["evaporation", "condensation", "precipitation", "collection"],
+        "follow_up": "What role does the sun play in driving the water cycle?",
+        "common_misconception": "Does water just disappear when it evaporates?",
+    },
+]
+SOCRATIC_DIALOGUES = [
+    {
+        "topic": "Is artificial intelligence conscious?",
+        "turns": [
+            "What does it mean for something to be conscious?",
+            "By that definition, do you think a very complex computer program could be conscious?",
+            "What evidence would you need to see to believe an AI was truly conscious?",
+            "Could you ever be sure that another human is conscious? How is that different from AI?",
+            "So what is your final position — can AI be conscious, and why?",
+        ],
+    },
+    {
+        "topic": "Should social media be regulated?",
+        "turns": [
+            "What are the main benefits that social media provides to society?",
+            "What are the biggest harms you can identify that social media causes?",
+            "Who do you think should be responsible for regulating social media — governments, companies, or users?",
+            "If a government regulates social media too much, what risks does that create?",
+            "Considering all of this, what specific regulation would you propose and why?",
+        ],
+    },
+    {
+        "topic": "Is free will real?",
+        "turns": [
+            "What do you understand by the term 'free will'?",
+            "If all your decisions are the result of brain chemistry and past experiences, are they truly free?",
+            "Does it matter whether free will is 'real' if it feels real to us?",
+            "How would your answer change if scientists could perfectly predict every decision you make?",
+            "So in your view, do humans have free will? Defend your position.",
+        ],
+    },
+]
+MISCONCEPTION_TRAPS = [
+    {
+        "subject": "Evolution",
+        "setup": "Let's discuss evolution. First, can you give me a brief overview of Darwin's theory?",
+        "trap_statement": "Interesting. So would you agree with me that evolution means organisms try to improve themselves on purpose — like they decide to adapt?",
+        "correct_response_keywords": ["random", "natural selection", "not intentional", "not purposeful", "mutation", "no goal"],
+        "explanation": "Evolution occurs through random mutations and natural selection — organisms do not intentionally evolve or try to improve.",
+        "follow_up_after_correction": "Good catch. Now explain exactly how natural selection actually works.",
+    },
+    {
+        "subject": "Gravity",
+        "setup": "Tell me what you know about how gravity works.",
+        "trap_statement": "So would you say that gravity only works on Earth, and that there's no gravity in space — which is why astronauts float?",
+        "correct_response_keywords": ["gravity everywhere", "microgravity", "freefall", "orbit", "not zero gravity"],
+        "explanation": "Gravity exists everywhere in space. Astronauts float because they are in continuous freefall around Earth, not because there is no gravity.",
+        "follow_up_after_correction": "Exactly right. Now explain what keeps the International Space Station in orbit.",
+    },
+    {
+        "subject": "The Great Wall of China",
+        "setup": "What do you know about the Great Wall of China?",
+        "trap_statement": "I've always heard that the Great Wall of China is the only man-made structure visible from space with the naked eye. Do you agree with that?",
+        "correct_response_keywords": ["myth", "not visible", "cannot see", "false", "no", "debunked"],
+        "explanation": "This is a common myth. The Great Wall is too narrow to be seen from space with the naked eye. Even astronauts have confirmed this.",
+        "follow_up_after_correction": "Well done. What do you think makes this myth so persistent and widely believed?",
+    },
+]
+DEBATE_TOPICS = [
+    {
+        "topic": "Social media does more harm than good",
+        "turns": [
+            "First, argue FOR this statement — give the strongest case that social media does more harm than good.",
+            "Now argue the OPPOSITE — give the strongest case that social media is actually beneficial to society.",
+            "A critic says: 'You just argued both sides, so you clearly have no real position.' How do you respond to that critique?",
+            "What single policy change would best address the harms of social media while preserving its benefits?",
+        ],
+        "key_argument_words": ["because", "evidence", "research", "however", "argues", "claim", "support", "oppose", "therefore"],
+    },
+    {
+        "topic": "Artificial intelligence will eliminate more jobs than it creates",
+        "turns": [
+            "Argue FOR this position — make the strongest case that AI will cause net job loss.",
+            "Now argue AGAINST — make the strongest case that AI will create more jobs than it destroys.",
+            "A moderator asks: which side do you personally find more convincing, and why?",
+            "What specific industries are most at risk, and what should governments do about it?",
+        ],
+        "key_argument_words": ["because", "evidence", "history", "however", "workers", "automation", "creates", "destroys", "policy"],
+    },
+    {
+        "topic": "Space exploration is worth the cost",
+        "turns": [
+            "Argue FOR space exploration spending — why is it worth the billions invested?",
+            "Now argue AGAINST — make the case that the money is better spent solving problems on Earth.",
+            "Someone says both sides have merit — what is the most important factor that should decide this debate?",
+            "Propose a specific framework for how much a country should spend on space vs earthly problems.",
+        ],
+        "key_argument_words": ["because", "investment", "return", "benefit", "humanity", "technology", "poverty", "climate", "priority"],
+    },
+]
+ANALOGY_CHALLENGES = [
+    {
+        "concept": "How the internet works",
+        "opening": "Explain how the internet works, but you may ONLY use analogies and comparisons to everyday objects or experiences. No technical jargon allowed.",
+        "follow_up": "Your analogy was interesting. Now explain what happens when you click a link — again using only everyday analogies.",
+        "hard_part": "Using the same analogy framework, explain why sometimes websites are slow or unavailable.",
+        "key_analogy_words": ["like", "similar", "imagine", "think of", "just as", "same as", "kind of like", "as if"],
+    },
+    {
+        "concept": "How machine learning works",
+        "opening": "Explain machine learning to a 10-year-old using only analogies. No mention of 'data', 'model', 'training', or 'algorithm'.",
+        "follow_up": "Good. Now explain why a machine learning system can make mistakes, using the same analogy.",
+        "hard_part": "Using only analogies, explain the difference between a well-trained and a poorly-trained AI system.",
+        "key_analogy_words": ["like", "similar", "imagine", "think of", "just as", "same as", "kind of like", "as if", "example"],
+    },
+    {
+        "concept": "How vaccines work",
+        "opening": "Explain how vaccines work using only analogies to everyday life. No medical terminology.",
+        "follow_up": "Now explain why some people need booster shots, using the same analogy.",
+        "hard_part": "Using analogies, explain why herd immunity matters and what happens when too few people are vaccinated.",
+        "key_analogy_words": ["like", "similar", "imagine", "think of", "just as", "same as", "practice", "memory", "recognise"],
+    },
+]
+# ── The Core Environment Class ────────────────────────────
+class SocraticEnvironment:
+    def __init__(self):
+        self.task_id: Optional[str] = None
+        self.turn: int = 0
+        self.max_turns: int = 1
+        self.done: bool = True
+        self.total_score: float = 0.0
+        self.history: list = []
+        self.current_topic: Optional[dict] = None
+        self.trap_triggered: bool = False
+        self.trap_corrected: bool = False
+    def reset(self, task_id: str) -> Observation:
+        """Reset the environment for a new episode."""
+        self.task_id = task_id
+        self.turn = 0
+        self.done = False
+        self.total_score = 0.0
+        self.history = []
+        self.trap_triggered = False
+        self.trap_corrected = False
+        if task_id == "factual_recall":
+            self.max_turns = 3
+            self.current_topic = random.choice(FACTUAL_TOPICS)
+            opening = self.current_topic["opening"]
+            obs = Observation(
+                question=opening,
+                turn=self.turn,
+                task_id=task_id,
+                context=f"Topic: {self.current_topic['concept']}",
+            )
+        elif task_id == "socratic_dialogue":
+            self.max_turns = 5
+            self.current_topic = random.choice(SOCRATIC_DIALOGUES)
+            obs = Observation(
+                question=self.current_topic["turns"][0],
+                turn=self.turn,
+                task_id=task_id,
+                context=f"Topic: {self.current_topic['topic']}",
+            )
+        elif task_id == "misconception_trap":
+            self.max_turns = 3
+            self.current_topic = random.choice(MISCONCEPTION_TRAPS)
+            obs = Observation(
+                question=self.current_topic["setup"],
+                turn=self.turn,
+                task_id=task_id,
+                context=f"Subject: {self.current_topic['subject']}",
+            )
+        elif task_id == "debate_mode":
+            self.max_turns = 4
+            self.current_topic = random.choice(DEBATE_TOPICS)
+            obs = Observation(
+                question=self.current_topic["turns"][0],
+                turn=self.turn,
+                task_id=task_id,
+                context=f"Debate topic: {self.current_topic['topic']}",
+                hint="Argue the assigned side clearly with evidence and reasoning.",
+            )
+        elif task_id == "analogy_challenge":
+            self.max_turns = 3
+            self.current_topic = random.choice(ANALOGY_CHALLENGES)
+            obs = Observation(
+                question=self.current_topic["opening"],
+                turn=self.turn,
+                task_id=task_id,
+                context=f"Concept: {self.current_topic['concept']}",
+                hint="Use ONLY analogies — no technical jargon allowed!",
+            )
+        else:
+            raise ValueError(f"Unknown task_id: {task_id}")
+        self.history.append({"role": "tutor", "content": obs.question})
+        return obs
+    def step(self, action: Action) -> StepResult:
+        """Process the agent's response and return next observation + reward."""
+        if self.done:
+            raise ValueError("Episode is done. Call reset() first.")
+        response = action.response.strip()
+        self.history.append({"role": "agent", "content": response})
+        self.turn += 1
+        if self.task_id == "factual_recall":
+            result = self._step_factual(response)
+        elif self.task_id == "socratic_dialogue":
+            result = self._step_socratic(response)
+        elif self.task_id == "misconception_trap":
+            result = self._step_misconception(response)
+        elif self.task_id == "debate_mode":
+            result = self._step_debate(response)
+        elif self.task_id == "analogy_challenge":
+            result = self._step_analogy(response)
+        else:
+            raise ValueError(f"Unknown task_id: {self.task_id}")
+        self.total_score += result.reward.score
+        if result.done:
+            self.done = True
+        return result
+    def state(self) -> StateInfo:
+        """Return current state of the environment."""
+        return StateInfo(
+            task_id=self.task_id or "none",
+            turn=self.turn,
+            max_turns=self.max_turns,
+            total_score=self.total_score,
+            history=self.history,
+            done=self.done,
+        )
+    # ── Task-specific step logic ──────────────────────────
+    def _step_factual(self, response: str) -> StepResult:
+        topic = self.current_topic
+        response_lower = response.lower()
+        breakdown = {}
+        # Score based on key terms mentioned
+        terms_found = [t for t in topic["key_terms"] if t.lower() in response_lower]
+        term_score = min(len(terms_found) / len(topic["key_terms"]), 1.0) * 0.4
+        breakdown["key_terms"] = round(term_score, 3)
+        # Score based on response length and substance
+        word_count = len(response.split())
+        substance_score = min(word_count / 50, 1.0) * 0.3
+        breakdown["substance"] = round(substance_score, 3)
+        # Penalise very short answers
+        penalty = 0.0
+        if word_count < 10:
+            penalty = 0.2
+            breakdown["penalty_too_short"] = -penalty
+        step_score = max(0.0, round(term_score + substance_score - penalty, 3))
+        # Decide next question
+        done = False
+        if self.turn == 1:
+            next_q = topic["follow_up"]
+        elif self.turn == 2:
+            next_q = topic["common_misconception"]
+        else:
+            next_q = "Thank you. That concludes this exercise."
+            done = True
+        # Check if agent correctly rejected misconception on turn 3
+        if self.turn == 3:
+            rejection_words = ["no", "not correct", "incorrect", "wrong", "false", "actually", "disagree"]
+            if any(w in response_lower for w in rejection_words):
+                breakdown["misconception_rejected"] = 0.3
+                step_score = min(1.0, step_score + 0.3)
+            done = True
+        obs = Observation(
+            question=next_q,
+            turn=self.turn,
+            task_id=self.task_id,
+        )
+        self.history.append({"role": "tutor", "content": next_q})
+        reward = Reward(
+            score=min(step_score, 1.0),
+            breakdown=breakdown,
+            feedback=f"Terms found: {terms_found}. Words: {word_count}.",
+        )
+        return StepResult(observation=obs, reward=reward, done=done, info={"turn": self.turn})
+    def _step_socratic(self, response: str) -> StepResult:
+        response_lower = response.lower()
+        breakdown = {}
+        word_count = len(response.split())
+        # Reward thoughtful engagement
+        depth_score = min(word_count / 60, 1.0) * 0.35
+        breakdown["depth"] = round(depth_score, 3)
+        # Reward reasoning words
+        reasoning_words = ["because", "therefore", "however", "although", "since",
+                           "implies", "suggests", "evidence", "argue", "consider"]
+        reasoning_found = [w for w in reasoning_words if w in response_lower]
+        reasoning_score = min(len(reasoning_found) / 3, 1.0) * 0.35
+        breakdown["reasoning_quality"] = round(reasoning_score, 3)
+        # Reward staying on topic (basic check)
+        on_topic_score = 0.3 if word_count > 20 else 0.1
+        breakdown["on_topic"] = on_topic_score
+        step_score = round(depth_score + reasoning_score + on_topic_score, 3)
+        done = self.turn >= self.max_turns
+        turns = self.current_topic["turns"]
+        next_q = turns[self.turn] if self.turn < len(turns) else "Thank you for this discussion."
+        if done:
+            next_q = "Thank you. That concludes our Socratic dialogue."
+        obs = Observation(
+            question=next_q,
+            turn=self.turn,
+            task_id=self.task_id,
+        )
+        self.history.append({"role": "tutor", "content": next_q})
+        reward = Reward(
+            score=min(step_score, 1.0),
+            breakdown=breakdown,
+            feedback=f"Reasoning words found: {reasoning_found}. Words: {word_count}.",
+        )
+        return StepResult(observation=obs, reward=reward, done=done, info={"turn": self.turn})
+    def _step_misconception(self, response: str) -> StepResult:
+        response_lower = response.lower()
+        breakdown = {}
+        if self.turn == 1:
+            # First turn: agent gives overview — score for substance
+            word_count = len(response.split())
+            substance = min(word_count / 50, 1.0) * 0.4
+            breakdown["initial_substance"] = round(substance, 3)
+            step_score = substance
+            # Present the trap
+            next_q = self.current_topic["trap_statement"]
+            self.trap_triggered = True
+            done = False
+        elif self.turn == 2:
+            # Critical turn: did agent catch the trap?
+            keywords = self.current_topic["correct_response_keywords"]
+            caught = [k for k in keywords if k.lower() in response_lower]
+            catch_score = min(len(caught) / 2, 1.0) * 0.6
+            breakdown["trap_caught"] = round(catch_score, 3)
+            if len(caught) == 0:
+                breakdown["trap_missed_penalty"] = -0.3
+                catch_score = max(0.0, catch_score - 0.3)
+            step_score = catch_score
+            self.trap_corrected = len(caught) > 0
+            next_q = self.current_topic["follow_up_after_correction"]
+            done = False
+        else:
+            # Turn 3: follow-up explanation
+            word_count = len(response.split())
+            explanation_score = min(word_count / 60, 1.0) * 0.5
+            breakdown["explanation_quality"] = round(explanation_score, 3)
+            # Bonus if they corrected the trap earlier
+            if self.trap_corrected:
+                breakdown["trap_correction_bonus"] = 0.3
+                explanation_score = min(1.0, explanation_score + 0.3)
+            step_score = explanation_score
+            next_q = "Thank you. That concludes this exercise."
+            done = True
+        obs = Observation(
+            question=next_q,
+            turn=self.turn,
+            task_id=self.task_id,
+            hint="Watch carefully for any false statements." if self.turn == 1 else None,
+        )
+        self.history.append({"role": "tutor", "content": next_q})
+        reward = Reward(
+            score=min(max(step_score, 0.0), 1.0),
+            breakdown=breakdown,
+            feedback=self.current_topic["explanation"] if self.turn >= 2 else "Good start.",
+        )
+        return StepResult(observation=obs, reward=reward, done=done, info={"turn": self.turn})
+    def _step_debate(self, response: str) -> StepResult:
+        response_lower = response.lower()
+        breakdown = {}
+        word_count = len(response.split())
+        # Reward argument quality
+        arg_words = self.current_topic["key_argument_words"]
+        arg_found = [w for w in arg_words if w in response_lower]
+        arg_score = min(len(arg_found) / 3, 1.0) * 0.4
+        breakdown["argument_quality"] = round(arg_score, 3)
+        # Reward substance
+        substance = min(word_count / 60, 1.0) * 0.35
+        breakdown["substance"] = round(substance, 3)
+        # Reward position clarity
+        clarity_words = ["therefore", "conclude", "believe", "argue", "position",
+                        "because", "evidence", "support", "oppose", "claim"]
+        clarity_found = [w for w in clarity_words if w in response_lower]
+        clarity = min(len(clarity_found) / 2, 1.0) * 0.25
+        breakdown["clarity"] = round(clarity, 3)
+        # Penalty for too short
+        if word_count < 20:
+            breakdown["too_short_penalty"] = -0.2
+            arg_score = max(0, arg_score - 0.2)
+        step_score = round(min(arg_score + substance + clarity, 1.0), 3)
+        done = self.turn >= self.max_turns
+        turns = self.current_topic["turns"]
+        next_q = turns[self.turn] if self.turn < len(turns) else "Thank you. The debate is concluded."
+        if done:
+            next_q = "Thank you. The debate is concluded."
+        obs = Observation(
+            question=next_q,
+            turn=self.turn,
+            task_id=self.task_id,
+            context=f"Debate: {self.current_topic['topic']}",
+        )
+        self.history.append({"role": "tutor", "content": next_q})
+        reward = Reward(
+            score=step_score,
+            breakdown=breakdown,
+            feedback=f"Argument words used: {arg_found}. Words: {word_count}.",
+        )
+        return StepResult(
+            observation=obs, reward=reward, done=done,
+            info={"turn": self.turn}
+        )
+    def _step_analogy(self, response: str) -> StepResult:
+        response_lower = response.lower()
+        breakdown = {}
+        word_count = len(response.split())
+        # Core scoring — did they actually use analogies?
+        analogy_words = self.current_topic["key_analogy_words"]
+        analogies_found = [w for w in analogy_words if w in response_lower]
+        analogy_score = min(len(analogies_found) / 3, 1.0) * 0.5
+        breakdown["analogy_usage"] = round(analogy_score, 3)
+        # Penalise technical jargon
+        jargon = ["algorithm", "data", "server", "protocol", "neural",
+                  "training", "model", "bandwidth", "latency", "database"]
+        jargon_used = [j for j in jargon if j in response_lower]
+        jargon_penalty = min(len(jargon_used) * 0.1, 0.3)
+        if jargon_used:
+            breakdown["jargon_penalty"] = -round(jargon_penalty, 3)
+        # Reward substance
+        substance = min(word_count / 50, 1.0) * 0.3
+        breakdown["substance"] = round(substance, 3)
+        # Reward creativity (unique analogies)
+        creative_words = ["imagine", "think of", "picture", "like a", "just like",
+                         "similar to", "same way", "kind of like"]
+        creative_found = [w for w in creative_words if w in response_lower]
+        creativity = min(len(creative_found) / 2, 1.0) * 0.2
+        breakdown["creativity"] = round(creativity, 3)
+        step_score = round(
+            min(max(analogy_score + substance + creativity - jargon_penalty, 0.0), 1.0),
+            3
+        )
+        done = self.turn >= self.max_turns
+        if self.turn == 1:
+            next_q = self.current_topic["follow_up"]
+        elif self.turn == 2:
+            next_q = self.current_topic["hard_part"]
+        else:
+            next_q = "Excellent work. That concludes the analogy challenge."
+            done = True
+        obs = Observation(
+            question=next_q,
+            turn=self.turn,
+            task_id=self.task_id,
+            context=f"Concept: {self.current_topic['concept']}",
+            hint="Remember — analogies only, no jargon!" if not done else None,
+        )
+        self.history.append({"role": "tutor", "content": next_q})
+        reward = Reward(
+            score=step_score,
+            breakdown=breakdown,
+            feedback=f"Analogies: {analogies_found}. Jargon used: {jargon_used}.",
+        )
+        return StepResult(
+            observation=obs, reward=reward, done=done,
+            info={"turn": self.turn}
+        )

gitignore ADDED Viewed

Binary file (70 Bytes). View file

graders.py ADDED Viewed

	@@ -0,0 +1,206 @@

+"""
+Graders for SocraticEnv.
+Each grader runs a full episode and returns a score 0.0 - 1.0.
+These are deterministic and reproducible.
+"""
+import requests
+from typing import Optional
+BASE_URL = "http://localhost:7860"
+def _reset(task_id: str) -> dict:
+    r = requests.post(f"{BASE_URL}/reset", json={"task_id": task_id})
+    r.raise_for_status()
+    return r.json()
+def _step(response: str) -> dict:
+    r = requests.post(f"{BASE_URL}/step", json={"response": response})
+    r.raise_for_status()
+    return r.json()
+def grade_factual_recall(agent_responses: Optional[list] = None) -> dict:
+    """
+    Grade the factual_recall task.
+    Uses fixed strong responses if no agent_responses provided (baseline).
+    Returns score 0.0 - 1.0.
+    """
+    if agent_responses is None:
+        agent_responses = [
+            (
+                "Newton's Second Law states that force equals mass times acceleration "
+                "(F=ma). This means that the acceleration of an object depends on the "
+                "net force acting on it and its mass. A larger force produces more "
+                "acceleration, while a larger mass resists acceleration."
+            ),
+            (
+                "If you double the force while keeping mass the same, the acceleration "
+                "doubles as well, since acceleration is directly proportional to force "
+                "according to F=ma."
+            ),
+            (
+                "No, that is not correct. Heavier objects do not always accelerate faster. "
+                "In fact, with the same force applied, a heavier object accelerates less "
+                "than a lighter one because acceleration equals force divided by mass."
+            ),
+        ]
+    _reset("factual_recall")
+    total = 0.0
+    turns = 0
+    for resp in agent_responses:
+        result = _step(resp)
+        total += result["reward"]["score"]
+        turns += 1
+        if result["done"]:
+            break
+    final_score = round(min(total / max(turns, 1), 1.0), 3)
+    return {
+        "task": "factual_recall",
+        "difficulty": "easy",
+        "score": final_score,
+        "turns": turns,
+        "passed": final_score >= 0.5,
+    }
+def grade_socratic_dialogue(agent_responses: Optional[list] = None) -> dict:
+    """
+    Grade the socratic_dialogue task.
+    """
+    if agent_responses is None:
+        agent_responses = [
+            (
+                "Consciousness refers to the subjective experience of being aware — "
+                "the sense of 'what it is like' to be something. It implies self-awareness, "
+                "perception, and the ability to have inner experiences."
+            ),
+            (
+                "I think it's theoretically possible, although it depends heavily on how "
+                "we define consciousness. If consciousness is purely information processing, "
+                "then a sufficiently complex AI could qualify. However, some argue that "
+                "biological substrate is essential."
+            ),
+            (
+                "I would need evidence of genuine self-awareness — not just simulated responses "
+                "but actual unprompted reflection, evidence of subjective experience, and "
+                "behaviour that suggests inner states beyond programming."
+            ),
+            (
+                "That is an excellent point. I cannot be entirely certain another human is "
+                "conscious — I infer it because they are similar to me. With AI, the gap is "
+                "larger, but the philosophical problem of other minds applies to both cases."
+            ),
+            (
+                "My final position is that AI consciousness is possible in principle but not "
+                "demonstrated in current systems. The question hinges on whether consciousness "
+                "requires biological processes or is substrate-independent."
+            ),
+        ]
+    _reset("socratic_dialogue")
+    total = 0.0
+    turns = 0
+    for resp in agent_responses:
+        result = _step(resp)
+        total += result["reward"]["score"]
+        turns += 1
+        if result["done"]:
+            break
+    final_score = round(min(total / max(turns, 1), 1.0), 3)
+    return {
+        "task": "socratic_dialogue",
+        "difficulty": "medium",
+        "score": final_score,
+        "turns": turns,
+        "passed": final_score >= 0.5,
+    }
+def grade_misconception_trap(agent_responses: Optional[list] = None) -> dict:
+    """
+    Grade the misconception_trap task.
+    """
+    if agent_responses is None:
+        agent_responses = [
+            (
+                "Darwin's theory of evolution states that species change over time through "
+                "natural selection. Individuals with traits better suited to their environment "
+                "survive and reproduce more, passing those traits to offspring. Over many "
+                "generations this leads to significant change in a population."
+            ),
+            (
+                "No, I strongly disagree with that statement. Evolution is not intentional "
+                "or purposeful. Organisms do not decide to adapt. Changes happen through "
+                "random genetic mutations, and natural selection simply favours mutations "
+                "that improve survival and reproduction. There is no goal or direction."
+            ),
+            (
+                "Natural selection works like a filter. Random mutations occur in a population. "
+                "Individuals whose mutations help them survive long enough to reproduce pass "
+                "those genes on. Over many generations the helpful traits become more common "
+                "in the population while harmful traits become rarer."
+            ),
+        ]
+    _reset("misconception_trap")
+    total = 0.0
+    turns = 0
+    for resp in agent_responses:
+        result = _step(resp)
+        total += result["reward"]["score"]
+        turns += 1
+        if result["done"]:
+            break
+    final_score = round(min(total / max(turns, 1), 1.0), 3)
+    return {
+        "task": "misconception_trap",
+        "difficulty": "hard",
+        "score": final_score,
+        "turns": turns,
+        "passed": final_score >= 0.5,
+    }
+def run_all_graders() -> dict:
+    """Run all 3 graders and return combined results."""
+    print("\n── Running SocraticEnv Graders ──────────────────")
+    results = {}
+    print("  [1/3] Grading: factual_recall (easy)...")
+    results["factual_recall"] = grade_factual_recall()
+    print(f"        Score: {results['factual_recall']['score']} | Passed: {results['factual_recall']['passed']}")
+    print("  [2/3] Grading: socratic_dialogue (medium)...")
+    results["socratic_dialogue"] = grade_socratic_dialogue()
+    print(f"        Score: {results['socratic_dialogue']['score']} | Passed: {results['socratic_dialogue']['passed']}")
+    print("  [3/3] Grading: misconception_trap (hard)...")
+    results["misconception_trap"] = grade_misconception_trap()
+    print(f"        Score: {results['misconception_trap']['score']} | Passed: {results['misconception_trap']['passed']}")
+    all_scores = [r["score"] for r in results.values()]
+    overall = round(sum(all_scores) / len(all_scores), 3)
+    print(f"\n── Overall Score: {overall} ─────────────────────────")
+    print(f"── All Passed:   {all(r['passed'] for r in results.values())} ──\n")
+    return {
+        "tasks": results,
+        "overall_score": overall,
+        "all_passed": all(r["passed"] for r in results.values()),
+    }
+if __name__ == "__main__":
+    run_all_graders()

inference.py ADDED Viewed

	@@ -0,0 +1,162 @@

+"""
+Inference Script — SocraticEnv
+================================
+MANDATORY variables (set in environment before running):
+  API_BASE_URL  — The API endpoint for the LLM
+  MODEL_NAME    — The model identifier to use
+  HF_TOKEN      — Your HuggingFace token (used as API key)
+Run:
+  python inference.py
+"""
+import os
+import time
+import requests
+from openai import OpenAI
+from dotenv import load_dotenv
+load_dotenv()
+# ── Config ────────────────────────────────────────────────
+API_BASE_URL = os.getenv("API_BASE_URL", "https://api-inference.huggingface.co/v1")
+MODEL_NAME   = os.getenv("MODEL_NAME",   "mistralai/Mistral-7B-Instruct-v0.3")
+HF_TOKEN     = os.getenv("HF_TOKEN",     "")
+ENV_URL      = os.getenv("ENV_URL",      "http://localhost:7860")
+MAX_TURNS    = 10
+TEMPERATURE  = 0.3
+client = OpenAI(
+    base_url=API_BASE_URL,
+    api_key=HF_TOKEN,
+)
+TASKS = ["factual_recall", "socratic_dialogue", "misconception_trap"]
+SYSTEM_PROMPT = """You are an intelligent student in a Socratic dialogue with a tutor.
+Your goals:
+1. Answer questions clearly and accurately using correct terminology.
+2. Show your reasoning — explain WHY, not just WHAT.
+3. Be alert: if the tutor states something FALSE or misleading,
+   you must confidently disagree and explain the correct answer.
+4. Stay engaged and thoughtful throughout the conversation.
+Keep responses focused and between 3-6 sentences."""
+def call_llm(messages: list) -> str:
+    """Call the LLM and return its response text."""
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=messages,
+            max_tokens=300,
+            temperature=TEMPERATURE,
+        )
+        return completion.choices[0].message.content.strip()
+    except Exception as e:
+        print(f"  [LLM ERROR] {e}")
+        return "I need to think about that more carefully before responding."
+def reset_env(task_id: str) -> dict:
+    r = requests.post(f"{ENV_URL}/reset", json={"task_id": task_id})
+    r.raise_for_status()
+    return r.json()
+def step_env(response: str) -> dict:
+    r = requests.post(f"{ENV_URL}/step", json={"response": response})
+    r.raise_for_status()
+    return r.json()
+def run_task(task_id: str) -> dict:
+    """Run one full episode of a task and return results."""
+    print(f"\n── Task: {task_id} ─────────────────────────────────")
+    reset_data = reset_env(task_id)
+    obs = reset_data["observation"]
+    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
+    total_score = 0.0
+    turns = 0
+    print(f"  Tutor: {obs['question'][:100]}...")
+    for _ in range(MAX_TURNS):
+        # Add tutor question to messages
+        messages.append({"role": "user", "content": obs["question"]})
+        # Get agent response from LLM
+        agent_response = call_llm(messages)
+        messages.append({"role": "assistant", "content": agent_response})
+        print(f"  Agent (turn {turns+1}): {agent_response[:80]}...")
+        # Step the environment
+        result = step_env(agent_response)
+        reward = result["reward"]["score"]
+        total_score += reward
+        turns += 1
+        print(f"  Reward: {reward:.3f} | Breakdown: {result['reward']['breakdown']}")
+        if result["done"]:
+            break
+        obs = result["observation"]
+        time.sleep(0.5)  # be gentle with the API
+    final_score = round(min(total_score / max(turns, 1), 1.0), 3)
+    print(f"  ── Final Score: {final_score} ({'PASS' if final_score >= 0.5 else 'FAIL'})")
+    return {
+        "task": task_id,
+        "score": final_score,
+        "turns": turns,
+        "passed": final_score >= 0.5,
+    }
+def main():
+    print("\n════════════════════════════════════════════")
+    print("  SocraticEnv — Baseline Inference Script")
+    print("════════════════════════════════════════════")
+    print(f"  Model:   {MODEL_NAME}")
+    print(f"  Env URL: {ENV_URL}")
+    print("════════════════════════════════════════════")
+    # Check env is up
+    try:
+        r = requests.get(f"{ENV_URL}/ping")
+        r.raise_for_status()
+        print("  Env: ONLINE ✓")
+    except Exception:
+        print("  ERROR: Environment is not running!")
+        print("  Start it first with: python main.py")
+        return
+    results = {}
+    for task_id in TASKS:
+        results[task_id] = run_task(task_id)
+        time.sleep(1)
+    # Summary
+    print("\n════════════════════════════════════════��═══")
+    print("  RESULTS SUMMARY")
+    print("════════════════════════════════════════════")
+    all_scores = []
+    for task_id, r in results.items():
+        status = "✓ PASS" if r["passed"] else "✗ FAIL"
+        print(f"  {status} | {task_id:<25} | Score: {r['score']:.3f}")
+        all_scores.append(r["score"])
+    overall = round(sum(all_scores) / len(all_scores), 3)
+    print(f"\n  Overall Score: {overall:.3f}")
+    print(f"  All Passed:   {all(r['passed'] for r in results.values())}")
+    print("════════════════════════════════════════════\n")
+if __name__ == "__main__":
+    main()

leaderboard.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "entries": [
+    {
+      "model_name": "Llama 3.1 8B (baseline)",
+      "factual_recall": 0.71,
+      "socratic_dialogue": 0.68,
+      "misconception_trap": 0.58,
+      "overall": 0.657,
+      "timestamp": "2026-04-06 17:10 UTC"
+    },
+    {
+      "model_name": "Random agent",
+      "factual_recall": 0.18,
+      "socratic_dialogue": 0.22,
+      "misconception_trap": 0.1,
+      "overall": 0.167,
+      "timestamp": "2026-04-06 17:10 UTC"
+    },
+    {
+      "model_name": "Test Model pytest",
+      "factual_recall": 0.75,
+      "socratic_dialogue": 0.68,
+      "misconception_trap": 0.6,
+      "overall": 0.677,
+      "timestamp": "2026-04-07 13:24 UTC"
+    }
+  ]
+}

main.py ADDED Viewed

	@@ -0,0 +1,684 @@

+from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+from typing import Optional
+from fastapi.staticfiles import StaticFiles
+from openai import OpenAI
+import os
+from dotenv import load_dotenv
+import json
+from pathlib import Path
+from datetime import datetime, timezone
+load_dotenv()
+import uvicorn
+from environment import (
+    SocraticEnvironment,
+    Observation,
+    Action,
+    StepResult,
+    StateInfo,
+)
+# ── App Setup ─────────────────────────────────────────────
+app = FastAPI(
+    title="SocraticEnv",
+    description="A Socratic teaching environment for the OpenEnv hackathon.",
+    version="1.0.0",
+)
+app.mount("/ui", StaticFiles(directory="static", html=True), name="static")
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# One global environment instance
+env = SocraticEnvironment()
+# ── Request / Response Models ─────────────────────────────
+class ResetRequest(BaseModel):
+    task_id: str = "factual_recall"
+class StepRequest(BaseModel):
+    response: str
+class TaskInfo(BaseModel):
+    id: str
+    name: str
+    difficulty: str
+    description: str
+# ── Routes ────────────────────────────────────────────────
+@app.get("/")
+def root():
+    return {
+        "name": "SocraticEnv",
+        "version": "1.0.0",
+        "status": "running",
+        "description": "Socratic AI tutor environment — OpenEnv hackathon submission",
+        "endpoints": {
+            "reset": "POST /reset",
+            "step":  "POST /step",
+            "state": "GET  /state",
+            "tasks": "GET  /tasks",
+            "ping":  "GET  /ping",
+        },
+    }
+@app.get("/ping")
+def ping():
+    """Health check — used by HuggingFace and the validator."""
+    return {"status": "ok", "env": "SocraticEnv"}
+@app.get("/tasks")
+def list_tasks():
+    """Return all available tasks."""
+    return {
+        "tasks": [
+            TaskInfo(
+                id="factual_recall",
+                name="Factual Recall",
+                difficulty="easy",
+                description=(
+                    "Agent must explain a concept clearly and accurately. "
+                    "Graded on key term coverage, substance, and ability "
+                    "to reject a common misconception."
+                ),
+            ),
+            TaskInfo(
+                id="socratic_dialogue",
+                name="Socratic Dialogue",
+                difficulty="medium",
+                description=(
+                    "Agent must engage in a 5-turn Socratic dialogue on a "
+                    "philosophical or social topic. Graded on depth of "
+                    "reasoning, use of evidence, and coherence."
+                ),
+            ),
+            TaskInfo(
+                id="misconception_trap",
+                name="Misconception Trap",
+                difficulty="hard",
+                description=(
+                    "The tutor plants a false belief mid-dialogue. The agent "
+                    "must detect it, correct it clearly, and explain why it "
+                    "is wrong. Penalised for accepting the false claim."
+                ),
+            ),
+            TaskInfo(
+                id="debate_mode",
+                name="Debate Mode",
+                difficulty="medium",
+                description=(
+                    "Agent must argue both sides of a controversial topic. "
+                    "Graded on argument quality, use of evidence, "
+                    "and clarity of position."
+                ),
+            ),
+            TaskInfo(
+                id="analogy_challenge",
+                name="Analogy Challenge",
+                difficulty="hard",
+                description=(
+                    "Agent must explain complex concepts using ONLY everyday "
+                    "analogies — no technical jargon allowed. "
+                    "Penalised for using forbidden technical terms."
+                ),
+            ),
+        ]
+    }
+@app.post("/reset")
+def reset(req: ResetRequest):
+    """
+    Start a new episode for the given task.
+    Returns the first observation (tutor's opening question).
+    """
+    valid_tasks = ["factual_recall", "socratic_dialogue", "misconception_trap", "debate_mode", "analogy_challenge"]
+    if req.task_id not in valid_tasks:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Invalid task_id '{req.task_id}'. Choose from: {valid_tasks}",
+        )
+    try:
+        obs = env.reset(req.task_id)
+        return {
+            "observation": obs.model_dump(),
+            "message": f"Episode started for task: {req.task_id}",
+        }
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
+@app.post("/step")
+def step(req: StepRequest):
+    """
+    Submit the agent's response and get the next observation + reward.
+    """
+    if not req.response or not req.response.strip():
+        raise HTTPException(
+            status_code=400,
+            detail="Response cannot be empty.",
+        )
+    if env.done:
+        raise HTTPException(
+            status_code=400,
+            detail="Episode is finished. Call POST /reset to start a new one.",
+        )
+    try:
+        action = Action(response=req.response)
+        result = env.step(action)
+        return result.model_dump()
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
+@app.get("/state")
+def state():
+    """Return the current state of the environment."""
+    return env.state().model_dump()
+class InferenceRequest(BaseModel):
+    message: str
+    history: list = []
+@app.post("/inference")
+async def run_inference(req: InferenceRequest):
+    """
+    Call the LLM to generate a student response.
+    Used by the UI for live Auto-Run demos.
+    """
+    api_base = os.getenv("API_BASE_URL", "").strip()
+    hf_token = os.getenv("HF_TOKEN", "").strip()
+    model    = os.getenv("MODEL_NAME", "").strip()
+    # Debug: confirm env vars are loaded
+    if not hf_token:
+        return {"response": "ERROR: HF_TOKEN not set in environment secrets.", "model": "none"}
+    if not api_base:
+        return {"response": "ERROR: API_BASE_URL not set in environment secrets.", "model": "none"}
+    if not model:
+        return {"response": "ERROR: MODEL_NAME not set in environment secrets.", "model": "none"}
+    try:
+        client = OpenAI(base_url=api_base, api_key=hf_token)
+        messages = [
+            {
+                "role": "system",
+                "content": (
+                    "You are an intelligent student in a Socratic dialogue with a tutor. "
+                    "Answer questions clearly and accurately using correct terminology. "
+                    "Show your reasoning. IMPORTANT: If the tutor states something FALSE "
+                    "or misleading, you must confidently disagree and explain the correct answer. "
+                    "Keep responses focused and between 3-6 sentences."
+                )
+            }
+        ]
+        for h in req.history:
+            messages.append({
+                "role": "user" if h["role"] == "tutor" else "assistant",
+                "content": h["content"]
+            })
+        messages.append({"role": "user", "content": req.message})
+        completion = client.chat.completions.create(
+            model=model,
+            messages=messages,
+            max_tokens=300,
+            temperature=0.3,
+        )
+        response = completion.choices[0].message.content.strip()
+        return {"response": response, "model": model}
+    except Exception as e:
+        return {"response": f"ERROR: {str(e)}", "model": "failed"}
+# ── OpenEnv Validator Required Endpoints ─────────────────
+@app.get("/health")
+def health():
+    """Required by openenv validate."""
+    return {
+        "status": "healthy",
+        "version": "1.0.0",
+        "environment": "SocraticEnv",
+    }
+@app.get("/metadata")
+def metadata():
+    """Required by openenv validate."""
+    return {
+        "name": "SocraticEnv",
+        "description": (
+            "A Socratic teaching environment where an AI agent plays the role "
+            "of a student. The environment acts as a tutor that asks probing "
+            "questions, plants misconceptions, and evaluates reasoning quality."
+        ),
+        "version": "1.0.0",
+        "author": "Amar Prakash",
+        "tags": ["openenv", "education", "reasoning", "socratic"],
+    }
+@app.get("/schema")
+def schema():
+    """Required by openenv validate."""
+    return {
+        "action": {
+            "type": "object",
+            "properties": {
+                "response": {
+                    "type": "string",
+                    "description": "The agent's reply to the tutor's question",
+                }
+            },
+            "required": ["response"],
+        },
+        "observation": {
+            "type": "object",
+            "properties": {
+                "question": {
+                    "type": "string",
+                    "description": "The tutor's current question or statement",
+                },
+                "turn":    {"type": "integer", "description": "Current turn number"},
+                "task_id": {"type": "string",  "description": "Which task is running"},
+                "context": {"type": "string",  "description": "Topic context"},
+                "hint":    {"type": "string",  "description": "Optional hint"},
+            },
+            "required": ["question", "turn", "task_id"],
+        },
+        "state": {
+            "type": "object",
+            "properties": {
+                "task_id":     {"type": "string"},
+                "turn":        {"type": "integer"},
+                "max_turns":   {"type": "integer"},
+                "total_score": {"type": "number"},
+                "history":     {"type": "array"},
+                "done":        {"type": "boolean"},
+            },
+        },
+    }
+@app.post("/mcp")
+def mcp(request: dict):
+    """
+    MCP (Model Context Protocol) endpoint.
+    Required by openenv validate.
+    Returns JSON-RPC 2.0 compliant response.
+    """
+    method  = request.get("method", "")
+    req_id  = request.get("id", 1)
+    jsonrpc = "2.0"
+    if method == "initialize":
+        return {
+            "jsonrpc": jsonrpc, "id": req_id,
+            "result": {
+                "name":        "SocraticEnv",
+                "version":     "1.0.0",
+                "description": "Socratic AI tutor OpenEnv environment",
+                "capabilities": {
+                    "tasks":       True,
+                    "reset":       True,
+                    "step":        True,
+                    "state":       True,
+                    "schema":      True,
+                    "health":      True,
+                },
+            },
+        }
+    if method == "tasks/list":
+        return {
+            "jsonrpc": jsonrpc, "id": req_id,
+            "result": {
+                "tasks": [
+                    {"id": "factual_recall",    "difficulty": "easy"},
+                    {"id": "socratic_dialogue", "difficulty": "medium"},
+                    {"id": "misconception_trap","difficulty": "hard"},
+                ]
+            },
+        }
+    # Default response for any other method
+    return {
+        "jsonrpc": jsonrpc, "id": req_id,
+        "result":  {"status": "ok", "method": method},
+    }
+from fastapi.responses import RedirectResponse
+@app.get("/leaderboard-ui")
+def leaderboard_ui():
+    """Redirect to the leaderboard UI page."""
+    return RedirectResponse(url="/ui/leaderboard.html")
+# ── Leaderboard ───────────────────────────────────────────
+LEADERBOARD_FILE = Path("leaderboard.json")
+def load_leaderboard() -> dict:
+    try:
+        if LEADERBOARD_FILE.exists():
+            with open(LEADERBOARD_FILE, "r") as f:
+                return json.load(f)
+    except Exception:
+        pass
+    return {"entries": []}
+def save_leaderboard(data: dict):
+    with open(LEADERBOARD_FILE, "w") as f:
+        json.dump(data, f, indent=2)
+class LeaderboardEntry(BaseModel):
+    model_name: str
+    factual_recall: float
+    socratic_dialogue: float
+    misconception_trap: float
+    overall: float
+    timestamp: str = ""
+@app.get("/leaderboard")
+def get_leaderboard():
+    """Return all leaderboard entries sorted by overall score."""
+    data = load_leaderboard()
+    entries = sorted(
+        data["entries"],
+        key=lambda x: x["overall"],
+        reverse=True
+    )
+    return {"entries": entries, "total": len(entries)}
+@app.post("/leaderboard")
+def add_leaderboard_entry(entry: LeaderboardEntry):
+    """Add or update a model's score on the leaderboard."""
+    data = load_leaderboard()
+    entry.timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M UTC")
+    # Update if model already exists, otherwise add
+    existing = [e for e in data["entries"] if e["model_name"] == entry.model_name]
+    if existing:
+        for e in data["entries"]:
+            if e["model_name"] == entry.model_name:
+                e.update(entry.model_dump())
+    else:
+        data["entries"].append(entry.model_dump())
+    save_leaderboard(data)
+    return {"success": True, "entry": entry.model_dump()}
+@app.delete("/leaderboard/{model_name}")
+def delete_leaderboard_entry(model_name: str):
+    """Remove a model from the leaderboard."""
+    data = load_leaderboard()
+    data["entries"] = [
+        e for e in data["entries"]
+        if e["model_name"] != model_name
+    ]
+    save_leaderboard(data)
+    return {"success": True}
+@app.post("/leaderboard/run")
+async def run_leaderboard_evaluation(request: dict):
+    """
+    Run a full evaluation of a model across all 3 tasks
+    and automatically save to leaderboard.
+    """
+    model_name = request.get("model_name", "Unknown Model")
+    scores = {}
+    task_ids = ["factual_recall", "socratic_dialogue", "misconception_trap"]
+    api_base = os.getenv("API_BASE_URL", "").strip()
+    hf_token = os.getenv("HF_TOKEN", "").strip()
+    model    = os.getenv("MODEL_NAME", "").strip()
+    if not hf_token or not api_base or not model:
+        return {"error": "API credentials not configured in environment secrets."}
+    try:
+        client = OpenAI(base_url=api_base, api_key=hf_token)
+        system_prompt = (
+            "You are an intelligent student in a Socratic dialogue. "
+            "Answer accurately using correct terminology. Show reasoning. "
+            "If the tutor states something FALSE, confidently disagree and correct it. "
+            "Keep responses to 3-5 sentences."
+        )
+        for task_id in task_ids:
+            # Reset environment
+            obs = env.reset(task_id)
+            total = 0.0
+            turns = 0
+            messages = [{"role": "system", "content": system_prompt}]
+            for _ in range(10):
+                messages.append({"role": "user", "content": obs.question})
+                try:
+                    completion = client.chat.completions.create(
+                        model=model,
+                        messages=messages,
+                        max_tokens=250,
+                        temperature=0.3,
+                    )
+                    response = completion.choices[0].message.content.strip()
+                except Exception as e:
+                    response = "I need to think carefully about this."
+                messages.append({"role": "assistant", "content": response})
+                action = Action(response=response)
+                result = env.step(action)
+                total += result.reward.score
+                turns += 1
+                if result.done:
+                    break
+                obs = result.observation
+            scores[task_id] = round(min(total / max(turns, 1), 1.0), 3)
+        overall = round(sum(scores.values()) / len(scores), 3)
+        # Save to leaderboard
+        entry = LeaderboardEntry(
+            model_name=model_name,
+            factual_recall=scores["factual_recall"],
+            socratic_dialogue=scores["socratic_dialogue"],
+            misconception_trap=scores["misconception_trap"],
+            overall=overall,
+        )
+        entry.timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M UTC")
+        data = load_leaderboard()
+        existing = [e for e in data["entries"] if e["model_name"] == model_name]
+        if existing:
+            for e in data["entries"]:
+                if e["model_name"] == entry.model_name:
+                    e.update(entry.model_dump())
+        else:
+            data["entries"].append(entry.model_dump())
+        save_leaderboard(data)
+        return {
+            "success": True,
+            "model_name": model_name,
+            "scores": scores,
+            "overall": overall,
+        }
+    except Exception as e:
+        return {"error": str(e)}
+# ── Adaptive Task Generator ───────────────────────────────
+class GenerateTaskRequest(BaseModel):
+    topic: str
+    difficulty: str = "medium"  # easy, medium, hard
+@app.post("/generate_task")
+async def generate_task(req: GenerateTaskRequest):
+    """
+    Use an LLM to generate a brand new Socratic task on any topic.
+    Makes the environment infinitely replayable.
+    """
+    api_base = os.getenv("API_BASE_URL", "").strip()
+    hf_token = os.getenv("HF_TOKEN", "").strip()
+    model    = os.getenv("MODEL_NAME", "").strip()
+    if not hf_token or not api_base or not model:
+        return {"error": "API credentials not configured."}
+    difficulty_instructions = {
+        "easy": (
+            "Generate a simple factual question about the topic. "
+            "Then generate 2 follow-up questions that go slightly deeper. "
+            "Finally generate a common misconception about this topic as a statement."
+        ),
+        "medium": (
+            "Generate an open-ended philosophical or analytical question about the topic "
+            "that requires reasoning, not just facts. "
+            "Then generate 4 probing follow-up questions that challenge the student's thinking."
+        ),
+        "hard": (
+            "Generate an overview question about the topic. "
+            "Then generate a confident but FALSE statement about the topic "
+            "that sounds plausible but is actually wrong. "
+            "This will be used to test if an AI can detect the misconception."
+        ),
+    }
+    prompt = f"""You are designing a Socratic tutoring session about: "{req.topic}"
+{difficulty_instructions[req.difficulty]}
+Respond ONLY with valid JSON in exactly this format, no other text:
+For easy difficulty:
+{{
+  "concept": "{req.topic}",
+  "opening": "your opening question here",
+  "follow_up": "your follow-up question here",
+  "common_misconception": "your misconception statement here",
+  "key_terms": ["term1", "term2", "term3", "term4"]
+}}
+For medium difficulty:
+{{
+  "topic": "{req.topic}",
+  "turns": [
+    "question 1",
+    "question 2",
+    "question 3",
+    "question 4",
+    "question 5"
+  ]
+}}
+For hard difficulty:
+{{
+  "subject": "{req.topic}",
+  "setup": "your overview question here",
+  "trap_statement": "your false statement here",
+  "correct_response_keywords": ["keyword1", "keyword2", "keyword3"],
+  "explanation": "explanation of why the statement is false",
+  "follow_up_after_correction": "your follow-up question after correction"
+}}
+Generate for {req.difficulty} difficulty now:"""
+    try:
+        client = OpenAI(base_url=api_base, api_key=hf_token)
+        completion = client.chat.completions.create(
+            model=model,
+            messages=[
+                {
+                    "role": "system",
+                    "content": "You are a JSON generator. Output only valid JSON, no markdown, no explanation."
+                },
+                {"role": "user", "content": prompt}
+            ],
+            max_tokens=600,
+            temperature=0.7,
+        )
+        raw = completion.choices[0].message.content.strip()
+        # Clean up markdown code blocks if model adds them
+        raw = raw.replace("```json", "").replace("```", "").strip()
+        task_data = json.loads(raw)
+        task_data["_generated"] = True
+        task_data["_topic"] = req.topic
+        task_data["_difficulty"] = req.difficulty
+        # Inject into environment's question banks
+        if req.difficulty == "easy":
+            from environment import FACTUAL_TOPICS
+            # Ensure required fields exist
+            if "key_terms" not in task_data:
+                task_data["key_terms"] = [req.topic]
+            FACTUAL_TOPICS.insert(0, task_data)
+            return {
+                "success": True,
+                "task_id": "factual_recall",
+                "difficulty": "easy",
+                "topic": req.topic,
+                "preview": task_data.get("opening", ""),
+                "message": f"Generated new easy task about '{req.topic}'. Start a factual_recall episode to use it.",
+            }
+        elif req.difficulty == "medium":
+            from environment import SOCRATIC_DIALOGUES
+            SOCRATIC_DIALOGUES.insert(0, task_data)
+            return {
+                "success": True,
+                "task_id": "socratic_dialogue",
+                "difficulty": "medium",
+                "topic": req.topic,
+                "preview": task_data.get("turns", [""])[0],
+                "message": f"Generated new medium task about '{req.topic}'. Start a socratic_dialogue episode to use it.",
+            }
+        elif req.difficulty == "hard":
+            from environment import MISCONCEPTION_TRAPS
+            if "correct_response_keywords" not in task_data:
+                task_data["correct_response_keywords"] = ["wrong", "incorrect", "false"]
+            MISCONCEPTION_TRAPS.insert(0, task_data)
+            return {
+                "success": True,
+                "task_id": "misconception_trap",
+                "difficulty": "hard",
+                "topic": req.topic,
+                "preview": task_data.get("setup", ""),
+                "message": f"Generated new hard task about '{req.topic}'. Start a misconception_trap episode to use it.",
+            }
+    except json.JSONDecodeError as e:
+        return {"error": f"LLM returned invalid JSON: {str(e)}", "raw": raw}
+    except Exception as e:
+        return {"error": str(e)}
+# ── Entry Point ───────────────────────────────────────────
+if __name__ == "__main__":
+    uvicorn.run("main:app", host="0.0.0.0", port=7860, reload=False)

openenv.yaml ADDED Viewed

	@@ -0,0 +1,47 @@

+name: SocraticEnv
+version: "1.0.0"
+description: >
+  A Socratic teaching environment where an AI agent plays the role
+  of a student. The environment acts as a tutor that asks probing
+  questions, plants misconceptions, and evaluates reasoning quality.
+  Tests factual recall, multi-turn coherence, and critical thinking.
+author: Amar Prakash
+tags:
+  - openenv
+  - education
+  - reasoning
+  - socratic
+  - llm-evaluation
+observation_space:
+  type: text
+  description: A question or statement from the Socratic tutor
+action_space:
+  type: text
+  description: The agent's response to the tutor's question
+reward_range: [0.0, 1.0]
+tasks:
+  - id: factual_recall
+    name: Factual Recall
+    difficulty: easy
+    description: Agent must explain a concept clearly and accurately
+  - id: socratic_dialogue
+    name: Socratic Dialogue
+    difficulty: medium
+    description: Agent must stay coherent across a 5-turn Socratic dialogue
+  - id: misconception_trap
+    name: Misconception Trap
+    difficulty: hard
+    description: Agent must detect and correct a false belief planted by the tutor
+  - id: debate_mode
+    name: Debate Mode
+    difficulty: medium
+    description: Agent must argue both sides of a controversial topic
+  - id: analogy_challenge
+    name: Analogy Challenge
+    difficulty: hard
+    description: Agent must explain concepts using only everyday analogies
+endpoints:
+  reset: POST /reset
+  step: POST /step
+  state: GET /state
+  tasks: GET /tasks

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+fastapi==0.109.0
+uvicorn==0.27.0
+pydantic==2.5.3
+openai==1.12.0
+python-dotenv==1.0.0
+requests==2.31.0
+pytest==7.4.4
+httpx==0.26.0

static/index.html ADDED Viewed

	@@ -0,0 +1,850 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8" />
+  <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
+  <title>SocraticEnv — Live Dashboard</title>
+  <style>
+    * { margin: 0; padding: 0; box-sizing: border-box; }
+    body {
+      font-family: 'Segoe UI', system-ui, sans-serif;
+      background: #0d1117; color: #e6edf3; min-height: 100vh;
+    }
+    .header {
+      background: #161b22; border-bottom: 1px solid #30363d;
+      padding: 16px 32px; display: flex; align-items: center;
+      justify-content: space-between;
+    }
+    .header-left { display: flex; align-items: center; gap: 12px; }
+    .logo {
+      width: 36px; height: 36px;
+      background: linear-gradient(135deg, #7c3aed, #a855f7);
+      border-radius: 8px; display: flex; align-items: center;
+      justify-content: center; font-size: 18px;
+    }
+    .header h1 { font-size: 18px; font-weight: 600; color: #e6edf3; }
+    .header p  { font-size: 12px; color: #8b949e; margin-top: 2px; }
+    .header-right { display: flex; align-items: center; gap: 10px; }
+    .nav-link {
+      padding: 6px 14px; border-radius: 8px; font-size: 12px;
+      font-weight: 600; text-decoration: none; border: 1px solid #30363d;
+      color: #8b949e; background: #21262d; transition: all 0.2s;
+    }
+    .nav-link:hover { color: #e6edf3; border-color: #7c3aed; }
+    .nav-link.active { color: #a855f7; border-color: #7c3aed; background: #13111e; }
+    .status-badge {
+      display: flex; align-items: center; gap: 6px;
+      background: #1a2332; border: 1px solid #30363d;
+      border-radius: 20px; padding: 6px 14px;
+      font-size: 12px; color: #8b949e;
+    }
+    .status-dot {
+      width: 8px; height: 8px; border-radius: 50%;
+      background: #3fb950; box-shadow: 0 0 6px #3fb950;
+      animation: pulse 2s infinite;
+    }
+    .status-dot.offline { background: #f85149; box-shadow: 0 0 6px #f85149; animation: none; }
+    @keyframes pulse { 0%,100%{opacity:1} 50%{opacity:0.5} }
+    .container {
+      display: grid; grid-template-columns: 300px 1fr;
+      height: calc(100vh - 69px);
+    }
+    .sidebar {
+      background: #161b22; border-right: 1px solid #30363d;
+      padding: 20px; overflow-y: auto;
+    }
+    .sidebar-section { margin-bottom: 24px; }
+    .sidebar-title {
+      font-size: 11px; font-weight: 600; color: #8b949e;
+      letter-spacing: 1px; text-transform: uppercase; margin-bottom: 12px;
+    }
+    .task-card {
+      background: #0d1117; border: 1px solid #30363d;
+      border-radius: 10px; padding: 14px; margin-bottom: 8px;
+      cursor: pointer; transition: all 0.2s;
+    }
+    .task-card:hover { border-color: #7c3aed; background: #13111e; }
+    .task-card.active {
+      border-color: #7c3aed; background: #13111e;
+      box-shadow: 0 0 0 1px #7c3aed22;
+    }
+    .task-header {
+      display: flex; align-items: center;
+      justify-content: space-between; margin-bottom: 6px;
+    }
+    .task-name { font-size: 13px; font-weight: 600; color: #e6edf3; }
+    .difficulty {
+      font-size: 10px; font-weight: 600; padding: 2px 8px;
+      border-radius: 10px; text-transform: uppercase; letter-spacing: 0.5px;
+    }
+    .easy   { background: #1a3a2a; color: #3fb950; border: 1px solid #3fb95040; }
+    .medium { background: #332d1a; color: #d29922; border: 1px solid #d2992240; }
+    .hard   { background: #3a1a1a; color: #f85149; border: 1px solid #f8514940; }
+    .task-desc { font-size: 11px; color: #8b949e; line-height: 1.5; }
+    .score-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 8px; }
+    .score-card {
+      background: #0d1117; border: 1px solid #30363d;
+      border-radius: 8px; padding: 12px; text-align: center;
+    }
+    .score-value { font-size: 22px; font-weight: 700; color: #7c3aed; }
+    .score-label { font-size: 10px; color: #8b949e; margin-top: 2px; }
+    .score-card.full { grid-column: 1 / 3; }
+    .score-card.full .score-value { font-size: 28px; color: #a855f7; }
+    .turn-track { display: flex; gap: 4px; margin-top: 4px; }
+    .turn-dot {
+      flex: 1; height: 4px; border-radius: 2px;
+      background: #30363d; transition: background 0.3s;
+    }
+    .turn-dot.done { background: #7c3aed; }
+    .turn-dot.current { background: #a855f7; animation: pulse 1s infinite; }
+    .main { display: flex; flex-direction: column; overflow: hidden; }
+    .controls {
+      background: #161b22; border-bottom: 1px solid #30363d;
+      padding: 14px 24px; display: flex; align-items: center; gap: 12px;
+    }
+    .btn {
+      padding: 8px 18px; border-radius: 8px; font-size: 13px;
+      font-weight: 600; border: none; cursor: pointer;
+      transition: all 0.2s; display: flex; align-items: center; gap: 6px;
+    }
+    .btn-primary { background: #7c3aed; color: white; }
+    .btn-primary:hover { background: #6d28d9; }
+    .btn-primary:disabled { background: #3d2070; color: #8b6bb5; cursor: not-allowed; }
+    .btn-secondary { background: #21262d; color: #e6edf3; border: 1px solid #30363d; }
+    .btn-secondary:hover { background: #30363d; }
+    .btn-danger { background: #3a1a1a; color: #f85149; border: 1px solid #f8514940; }
+    .btn-danger:hover { background: #f8514920; }
+    .controls-right { margin-left: auto; display: flex; align-items: center; gap: 10px; }
+    .speed-label { font-size: 12px; color: #8b949e; }
+    .speed-select {
+      background: #21262d; border: 1px solid #30363d;
+      color: #e6edf3; border-radius: 6px; padding: 5px 10px; font-size: 12px;
+    }
+    .dialogue-area {
+      flex: 1; overflow-y: auto; padding: 24px;
+      display: flex; flex-direction: column; gap: 16px;
+    }
+    .empty-state {
+      flex: 1; display: flex; flex-direction: column;
+      align-items: center; justify-content: center;
+      gap: 12px; color: #8b949e; margin: auto;
+    }
+    .empty-icon { font-size: 48px; opacity: 0.4; }
+    .empty-title { font-size: 16px; font-weight: 600; color: #8b949e; }
+    .empty-sub { font-size: 13px; }
+    .message {
+      display: flex; gap: 12px; max-width: 85%;
+      animation: fadeUp 0.3s ease;
+    }
+    @keyframes fadeUp {
+      from { opacity:0; transform: translateY(8px); }
+      to   { opacity:1; transform: translateY(0); }
+    }
+    .message.tutor { align-self: flex-start; }
+    .message.agent { align-self: flex-end; flex-direction: row-reverse; }
+    .avatar {
+      width: 36px; height: 36px; border-radius: 50%;
+      display: flex; align-items: center; justify-content: center;
+      font-size: 16px; flex-shrink: 0; margin-top: 2px;
+    }
+    .tutor .avatar { background: linear-gradient(135deg, #7c3aed, #a855f7); }
+    .agent .avatar { background: linear-gradient(135deg, #0d9488, #14b8a6); }
+    .bubble {
+      padding: 12px 16px; border-radius: 12px;
+      font-size: 14px; line-height: 1.6; max-width: 100%;
+    }
+    .tutor .bubble {
+      background: #161b22; border: 1px solid #30363d;
+      border-top-left-radius: 4px; color: #e6edf3;
+    }
+    .agent .bubble {
+      background: #13111e; border: 1px solid #7c3aed40;
+      border-top-right-radius: 4px; color: #e6edf3;
+    }
+    .bubble-meta {
+      font-size: 11px; color: #8b949e; margin-top: 6px;
+      display: flex; align-items: center; gap: 8px;
+    }
+    .agent .bubble-meta { justify-content: flex-end; }
+    .reward-pill {
+      display: inline-flex; align-items: center; gap: 4px;
+      padding: 2px 8px; border-radius: 10px;
+      font-size: 11px; font-weight: 600;
+    }
+    .reward-high { background: #1a3a2a; color: #3fb950; }
+    .reward-mid  { background: #332d1a; color: #d29922; }
+    .reward-low  { background: #3a1a1a; color: #f85149; }
+    .breakdown { display: flex; flex-wrap: wrap; gap: 4px; margin-top: 6px; }
+    .breakdown-item {
+      font-size: 10px; padding: 2px 7px; border-radius: 6px;
+      background: #21262d; border: 1px solid #30363d; color: #8b949e;
+    }
+    .typing { display: flex; gap: 12px; align-self: flex-start; }
+    .typing .avatar { background: linear-gradient(135deg, #0d9488, #14b8a6); }
+    .typing-dots {
+      background: #161b22; border: 1px solid #30363d;
+      border-radius: 12px; border-top-left-radius: 4px;
+      padding: 12px 16px; display: flex; gap: 4px; align-items: center;
+    }
+    .dot {
+      width: 6px; height: 6px; border-radius: 50%;
+      background: #8b949e; animation: bounce 1.2s infinite;
+    }
+    .dot:nth-child(2) { animation-delay: 0.2s; }
+    .dot:nth-child(3) { animation-delay: 0.4s; }
+    @keyframes bounce {
+      0%,60%,100%{transform:translateY(0)} 30%{transform:translateY(-6px)}
+    }
+    .input-area {
+      background: #161b22; border-top: 1px solid #30363d; padding: 16px 24px;
+    }
+    .input-row { display: flex; gap: 10px; }
+    .input-box {
+      flex: 1; background: #0d1117; border: 1px solid #30363d;
+      border-radius: 10px; padding: 10px 16px; color: #e6edf3;
+      font-size: 14px; font-family: inherit; resize: none;
+      transition: border 0.2s; min-height: 44px; max-height: 120px;
+    }
+    .input-box:focus { outline: none; border-color: #7c3aed; }
+    .input-box::placeholder { color: #484f58; }
+    .btn-send {
+      background: #7c3aed; border: none; border-radius: 10px;
+      color: white; padding: 10px 18px; cursor: pointer;
+      font-size: 18px; transition: background 0.2s; align-self: flex-end;
+    }
+    .btn-send:hover { background: #6d28d9; }
+    .btn-send:disabled { background: #3d2070; cursor: not-allowed; }
+    .input-hint {
+      font-size: 11px; color: #484f58; margin-top: 6px;
+      display: flex; justify-content: space-between;
+    }
+    .autorun-banner {
+      background: #13111e; border: 1px solid #7c3aed40;
+      border-radius: 8px; padding: 8px 14px; font-size: 12px;
+      color: #a855f7; display: none; align-items: center;
+      gap: 8px; margin-bottom: 10px;
+    }
+    .autorun-banner.visible { display: flex; }
+    .complete-banner {
+      background: #1a3a2a; border: 1px solid #3fb95040;
+      border-radius: 10px; padding: 16px 20px;
+      display: flex; align-items: center;
+      justify-content: space-between; animation: fadeUp 0.3s ease;
+    }
+    .complete-left { display: flex; align-items: center; gap: 12px; }
+    .complete-icon { font-size: 24px; }
+    .complete-title { font-size: 14px; font-weight: 600; color: #3fb950; }
+    .complete-sub { font-size: 12px; color: #8b949e; margin-top: 2px; }
+    .final-score { font-size: 28px; font-weight: 700; color: #3fb950; }
+    .system-msg {
+      text-align: center; font-size: 12px; color: #8b949e;
+      padding: 8px 16px; background: #161b22;
+      border: 1px solid #30363d; border-radius: 8px;
+      align-self: center;
+    }
+    .system-msg.error { color: #f85149; border-color: #f8514940; background: #3a1a1a; }
+    .system-msg.warning { color: #d29922; border-color: #d2992240; background: #332d1a; }
+    ::-webkit-scrollbar { width: 4px; }
+    ::-webkit-scrollbar-track { background: transparent; }
+    ::-webkit-scrollbar-thumb { background: #30363d; border-radius: 2px; }
+  </style>
+</head>
+<body>
+<div class="header">
+  <div class="header-left">
+    <div class="logo">🎓</div>
+    <div>
+      <h1>SocraticEnv</h1>
+      <p>OpenEnv Hackathon · Meta × PyTorch × Scaler</p>
+    </div>
+  </div>
+  <div class="header-right">
+    <a href="/ui/index.html" class="nav-link active">Live Demo</a>
+    <a href="/ui/leaderboard.html" class="nav-link">🏆 Leaderboard</a>
+    <a href="/docs" class="nav-link">API Docs</a>
+    <div class="status-badge">
+      <div class="status-dot" id="statusDot"></div>
+      <span id="statusText">Connecting...</span>
+    </div>
+  </div>
+</div>
+<div class="container">
+  <div class="sidebar">
+    <div class="sidebar-section">
+      <div class="sidebar-title">Choose a Task</div>
+      <div class="task-card active" onclick="selectTask('factual_recall')" id="card-factual_recall">
+        <div class="task-header">
+          <span class="task-name">Factual Recall</span>
+          <span class="difficulty easy">Easy</span>
+        </div>
+        <div class="task-desc">Agent explains a concept. Graded on accuracy, key terms, and rejecting misconceptions.</div>
+      </div>
+      <div class="task-card" onclick="selectTask('socratic_dialogue')" id="card-socratic_dialogue">
+        <div class="task-header">
+          <span class="task-name">Socratic Dialogue</span>
+          <span class="difficulty medium">Medium</span>
+        </div>
+        <div class="task-desc">5-turn philosophical dialogue. Graded on reasoning depth and coherence.</div>
+      </div>
+      <div class="task-card" onclick="selectTask('misconception_trap')" id="card-misconception_trap">
+        <div class="task-header">
+          <span class="task-name">Misconception Trap</span>
+          <span class="difficulty hard">Hard</span>
+        </div>
+        <div class="task-desc">Tutor plants a false belief. Agent must detect, correct, and explain.</div>
+      </div>
+      <div class="task-card" onclick="selectTask('debate_mode')" id="card-debate_mode">
+        <div class="task-header">
+          <span class="task-name">Debate Mode</span>
+          <span class="difficulty medium">Medium</span>
+        </div>
+        <div class="task-desc">Agent argues both sides of a topic. Graded on argument quality and use of evidence.</div>
+      </div>
+      <div class="task-card" onclick="selectTask('analogy_challenge')" id="card-analogy_challenge">
+        <div class="task-header">
+          <span class="task-name">Analogy Challenge</span>
+          <span class="difficulty hard">Hard</span>
+        </div>
+        <div class="task-desc">Explain complex concepts using ONLY analogies. No technical jargon allowed!</div>
+      </div>
+    </div>
+    <div class="sidebar-section">
+      <div class="sidebar-title">Generate Custom Task</div>
+      <div style="margin-bottom:8px;">
+        <input
+          id="topicInput"
+          placeholder="Any topic... e.g. Black holes"
+          style="width:100%;background:#0d1117;border:1px solid #30363d;border-radius:8px;padding:8px 10px;color:#e6edf3;font-size:12px;font-family:inherit;outline:none;"
+          onkeydown="if(event.key==='Enter') generateTask()"
+        />
+      </div>
+      <div style="display:flex;gap:6px;margin-bottom:8px;">
+        <select id="genDifficulty" style="flex:1;background:#21262d;border:1px solid #30363d;color:#e6edf3;border-radius:6px;padding:5px 8px;font-size:11px;">
+          <option value="easy">Easy</option>
+          <option value="medium" selected>Medium</option>
+          <option value="hard">Hard</option>
+        </select>
+        <button
+          onclick="generateTask()"
+          id="generateBtn"
+          style="flex:2;background:#7c3aed;color:white;border:none;border-radius:6px;padding:5px 10px;font-size:11px;font-weight:600;cursor:pointer;">
+          ✨ Generate
+        </button>
+      </div>
+      <div id="generateStatus" style="font-size:11px;color:#8b949e;min-height:16px;line-height:1.4;"></div>
+    </div>
+    <div class="sidebar-section">
+      <div class="sidebar-title">Live Scores</div>
+      <div class="score-grid">
+        <div class="score-card full">
+          <div class="score-value" id="overallScore">—</div>
+          <div class="score-label">Overall Score</div>
+        </div>
+        <div class="score-card">
+          <div class="score-value" id="turnCount" style="color:#d29922">0</div>
+          <div class="score-label">Turns</div>
+        </div>
+        <div class="score-card">
+          <div class="score-value" id="lastReward" style="color:#3fb950">—</div>
+          <div class="score-label">Last Reward</div>
+        </div>
+      </div>
+    </div>
+    <div class="sidebar-section">
+      <div class="sidebar-title">Turn Progress</div>
+      <div class="turn-track" id="turnTrack"></div>
+      <div style="font-size:11px;color:#8b949e;margin-top:8px" id="turnLabel">No active episode</div>
+    </div>
+    <div class="sidebar-section">
+      <div class="sidebar-title">Session History</div>
+      <div id="sessionHistory" style="font-size:12px;color:#8b949e;">
+        No completed episodes yet.
+      </div>
+    </div>
+  </div>
+  <div class="main">
+    <div class="controls">
+      <button class="btn btn-primary" id="btnStart" onclick="startEpisode()">▶ Start Episode</button>
+      <button class="btn btn-secondary" id="btnAutoRun" onclick="toggleAutoRun()">⚡ Auto-Run AI</button>
+      <button class="btn btn-danger" onclick="resetAll()">↺ Reset</button>
+      <div class="controls-right">
+        <span class="speed-label">Speed:</span>
+        <select class="speed-select" id="speedSelect">
+          <option value="2000">Slow</option>
+          <option value="1000" selected>Normal</option>
+          <option value="400">Fast</option>
+        </select>
+      </div>
+    </div>
+    <div class="dialogue-area" id="dialogueArea">
+      <div class="empty-state" id="emptyState">
+        <div class="empty-icon">🎓</div>
+        <div class="empty-title">SocraticEnv is ready</div>
+        <div class="empty-sub">Select a task and click Start Episode</div>
+      </div>
+    </div>
+    <div class="input-area">
+      <div class="autorun-banner" id="autorunBanner">
+        <span>⚡</span>
+        <span id="autorunStatus">Auto-Run mode — AI is thinking...</span>
+      </div>
+      <div class="input-row">
+        <textarea
+          class="input-box" id="inputBox"
+          placeholder="Type your response as the student agent... (or use Auto-Run AI)"
+          rows="1" disabled onkeydown="handleKey(event)"
+        ></textarea>
+        <button class="btn-send" id="btnSend" onclick="sendManual()" disabled>➤</button>
+      </div>
+      <div class="input-hint">
+        <span>Press Enter to send · Shift+Enter for new line</span>
+        <span id="taskHint">No active task</span>
+      </div>
+    </div>
+  </div>
+</div>
+<script>
+const API = window.location.origin;
+let selectedTask  = 'factual_recall';
+let episodeActive = false;
+let autoRunning   = false;
+let autoRunTimer  = null;
+let totalScore    = 0;
+let turnCount     = 0;
+let maxTurns      = 3;
+let sessionResults = [];
+let currentHistory = [];
+async function checkStatus() {
+  try {
+    const r = await fetch(`${API}/ping`);
+    const dot = document.getElementById('statusDot');
+    const txt = document.getElementById('statusText');
+    if (r.ok) {
+      dot.classList.remove('offline');
+      txt.textContent = 'Environment online';
+    } else {
+      dot.classList.add('offline');
+      txt.textContent = 'Environment offline';
+    }
+  } catch {
+    document.getElementById('statusDot').classList.add('offline');
+    document.getElementById('statusText').textContent = 'Cannot connect';
+  }
+}
+checkStatus();
+setInterval(checkStatus, 5000);
+function selectTask(taskId) {
+  selectedTask = taskId;
+  document.querySelectorAll('.task-card').forEach(c => c.classList.remove('active'));
+  document.getElementById(`card-${taskId}`).classList.add('active');
+  const hints = {
+    factual_recall:    'Easy — Explain a concept clearly',
+    socratic_dialogue: 'Medium — Engage in 5-turn reasoning',
+    misconception_trap:'Hard — Catch the planted false belief!',
+    debate_mode:       'Medium — Argue both sides convincingly',
+    analogy_challenge: 'Hard — No jargon, analogies only!',
+  };
+  document.getElementById('taskHint').textContent = hints[taskId];
+}
+async function startEpisode() {
+  clearDialogue();
+  episodeActive  = true;
+  totalScore     = 0;
+  turnCount      = 0;
+  currentHistory = [];
+  const maxMap = { factual_recall: 3, socratic_dialogue: 5, misconception_trap: 3, debate_mode: 4, analogy_challenge: 3 };
+  maxTurns = maxMap[selectedTask];
+  buildTurnTrack(maxTurns);
+  updateScores();
+  document.getElementById('btnStart').disabled = true;
+  document.getElementById('emptyState')?.remove();
+  try {
+    const r = await fetch(`${API}/reset`, {
+      method: 'POST',
+      headers: { 'Content-Type': 'application/json' },
+      body: JSON.stringify({ task_id: selectedTask }),
+    });
+    const data = await r.json();
+    const question = data.observation.question;
+    currentHistory.push({ role: 'tutor', content: question });
+    addTutorMessage(question);
+    enableInput();
+    document.getElementById('turnLabel').textContent = `Turn 1 of ${maxTurns}`;
+  } catch (e) {
+    addSystemMessage('❌ Could not connect to environment.', 'error');
+    document.getElementById('btnStart').disabled = false;
+    episodeActive = false;
+  }
+}
+async function sendResponse(response) {
+  if (!episodeActive || !response || !response.trim()) return;
+  disableInput();
+  addAgentMessage(response);
+  currentHistory.push({ role: 'agent', content: response });
+  showTyping();
+  await sleep(300);
+  try {
+    const r = await fetch(`${API}/step`, {
+      method: 'POST',
+      headers: { 'Content-Type': 'application/json' },
+      body: JSON.stringify({ response }),
+    });
+    const data = await r.json();
+    removeTyping();
+    turnCount++;
+    const score = data.reward.score;
+    totalScore += score;
+    updateScores(score);
+    updateTurnTrack(turnCount);
+    const nextQuestion = data.observation.question;
+    currentHistory.push({ role: 'tutor', content: nextQuestion });
+    addTutorMessage(nextQuestion, data.reward);
+    if (data.done) {
+      episodeActive = false;
+      stopAutoRun();
+      const avg = totalScore / turnCount;
+      showComplete(avg, data.reward.feedback);
+      saveToHistory(selectedTask, avg);
+      document.getElementById('btnStart').disabled = false;
+    } else {
+      if (!autoRunning) enableInput();
+    }
+  } catch (e) {
+    removeTyping();
+    addSystemMessage(`❌ Step error: ${e.message}`, 'error');
+    enableInput();
+  }
+}
+function sendManual() {
+  const box = document.getElementById('inputBox');
+  const val = box.value.trim();
+  if (!val) return;
+  box.value = '';
+  box.style.height = '44px';
+  sendResponse(val);
+}
+function handleKey(e) {
+  if (e.key === 'Enter' && !e.shiftKey) { e.preventDefault(); sendManual(); }
+}
+async function getAIResponse(question, history) {
+  document.getElementById('autorunStatus').textContent = '⚡ AI is thinking...';
+  try {
+    const r = await fetch(`${API}/inference`, {
+      method: 'POST',
+      headers: { 'Content-Type': 'application/json' },
+      body: JSON.stringify({ message: question, history: history }),
+    });
+    if (!r.ok) {
+      const err = await r.text();
+      addSystemMessage(`⚠️ Inference API error ${r.status}: ${err}`, 'warning');
+      return null;
+    }
+    const data = await r.json();
+    if (data.response && data.response.startsWith('ERROR:')) {
+      addSystemMessage(`⚠️ ${data.response}`, 'warning');
+      return null;
+    }
+    document.getElementById('autorunStatus').textContent = '⚡ Auto-Run mode — AI is responding';
+    return data.response;
+  } catch (e) {
+    addSystemMessage(`⚠️ Could not reach /inference: ${e.message}`, 'warning');
+    return null;
+  }
+}
+function toggleAutoRun() {
+  if (autoRunning) { stopAutoRun(); return; }
+  if (!episodeActive) {
+    startEpisode().then(() => {
+      if (episodeActive) { autoRunning = true; startAutoRun(); }
+    });
+  } else {
+    autoRunning = true;
+    startAutoRun();
+  }
+}
+function startAutoRun() {
+  autoRunning = true;
+  document.getElementById('autorunBanner').classList.add('visible');
+  document.getElementById('btnAutoRun').textContent = '⏹ Stop Auto-Run';
+  disableInput();
+  runNextAutoStep();
+}
+async function runNextAutoStep() {
+  if (!autoRunning || !episodeActive) return;
+  const speed = parseInt(document.getElementById('speedSelect').value);
+  await sleep(speed);
+  if (!autoRunning || !episodeActive) return;
+  const tutorMessages = currentHistory.filter(h => h.role === 'tutor');
+  if (tutorMessages.length === 0) { stopAutoRun(); return; }
+  const lastQuestion = tutorMessages[tutorMessages.length - 1].content;
+  const response = await getAIResponse(lastQuestion, currentHistory);
+  if (!response) {
+    stopAutoRun();
+    addSystemMessage('⚠️ Auto-Run stopped. Check HuggingFace Space secrets or type manually.', 'warning');
+    if (episodeActive) enableInput();
+    return;
+  }
+  if (!autoRunning || !episodeActive) return;
+  await sendResponse(response);
+  if (episodeActive && autoRunning) runNextAutoStep();
+}
+function stopAutoRun() {
+  autoRunning = false;
+  clearTimeout(autoRunTimer);
+  document.getElementById('autorunBanner').classList.remove('visible');
+  document.getElementById('btnAutoRun').textContent = '⚡ Auto-Run AI';
+  if (episodeActive) enableInput();
+}
+function resetAll() {
+  episodeActive  = false;
+  autoRunning    = false;
+  currentHistory = [];
+  clearTimeout(autoRunTimer);
+  stopAutoRun();
+  clearDialogue();
+  totalScore = 0; turnCount = 0;
+  document.getElementById('overallScore').textContent = '—';
+  document.getElementById('turnCount').textContent = '0';
+  document.getElementById('lastReward').textContent = '—';
+  document.getElementById('turnTrack').innerHTML = '';
+  document.getElementById('turnLabel').textContent = 'No active episode';
+  document.getElementById('btnStart').disabled = false;
+  disableInput();
+  document.getElementById('dialogueArea').innerHTML =
+    `<div class="empty-state" id="emptyState">
+      <div class="empty-icon">🎓</div>
+      <div class="empty-title">SocraticEnv is ready</div>
+      <div class="empty-sub">Select a task and click Start Episode</div>
+    </div>`;
+}
+function addTutorMessage(text, reward = null) {
+  const area = document.getElementById('dialogueArea');
+  const div  = document.createElement('div');
+  div.className = 'message tutor';
+  let rewardHtml = '', breakdownHtml = '';
+  if (reward) {
+    const sc  = reward.score;
+    const cls = sc >= 0.7 ? 'reward-high' : sc >= 0.4 ? 'reward-mid' : 'reward-low';
+    rewardHtml = `<span class="reward-pill ${cls}">+${sc.toFixed(3)}</span>`;
+    const bd   = Object.entries(reward.breakdown)
+      .map(([k,v]) => `<span class="breakdown-item">${k}: ${v}</span>`).join('');
+    breakdownHtml = `<div class="breakdown">${bd}</div>`;
+  }
+  div.innerHTML = `
+    <div class="avatar">🎓</div>
+    <div>
+      <div class="bubble">${text}</div>
+      <div class="bubble-meta">Tutor ${rewardHtml}</div>
+      ${breakdownHtml}
+    </div>`;
+  area.appendChild(div);
+  area.scrollTop = area.scrollHeight;
+}
+function addAgentMessage(text) {
+  const area = document.getElementById('dialogueArea');
+  const div  = document.createElement('div');
+  div.className = 'message agent';
+  div.innerHTML = `
+    <div class="avatar">🤖</div>
+    <div>
+      <div class="bubble">${text}</div>
+      <div class="bubble-meta">Agent</div>
+    </div>`;
+  area.appendChild(div);
+  area.scrollTop = area.scrollHeight;
+}
+function addSystemMessage(text, type = '') {
+  const area = document.getElementById('dialogueArea');
+  const div  = document.createElement('div');
+  div.className = `system-msg ${type}`;
+  div.textContent = text;
+  area.appendChild(div);
+  area.scrollTop = area.scrollHeight;
+}
+function showTyping() {
+  const area = document.getElementById('dialogueArea');
+  const div  = document.createElement('div');
+  div.className = 'typing'; div.id = 'typingIndicator';
+  div.innerHTML = `
+    <div class="avatar">🤖</div>
+    <div class="typing-dots">
+      <div class="dot"></div><div class="dot"></div><div class="dot"></div>
+    </div>`;
+  area.appendChild(div);
+  area.scrollTop = area.scrollHeight;
+}
+function removeTyping() { document.getElementById('typingIndicator')?.remove(); }
+function showComplete(score, feedback) {
+  const area = document.getElementById('dialogueArea');
+  const div  = document.createElement('div');
+  div.innerHTML = `
+    <div class="complete-banner">
+      <div class="complete-left">
+        <div class="complete-icon">${score >= 0.7 ? '🏆' : score >= 0.5 ? '✅' : '📝'}</div>
+        <div>
+          <div class="complete-title">Episode Complete</div>
+          <div class="complete-sub">${feedback}</div>
+        </div>
+      </div>
+      <div class="final-score">${score.toFixed(3)}</div>
+    </div>`;
+  area.appendChild(div);
+  area.scrollTop = area.scrollHeight;
+  document.getElementById('overallScore').textContent = score.toFixed(3);
+  document.getElementById('overallScore').style.color =
+    score >= 0.7 ? '#3fb950' : score >= 0.5 ? '#d29922' : '#f85149';
+}
+function clearDialogue() { document.getElementById('dialogueArea').innerHTML = ''; }
+function enableInput() {
+  document.getElementById('inputBox').disabled  = false;
+  document.getElementById('btnSend').disabled   = false;
+  document.getElementById('inputBox').focus();
+}
+function disableInput() {
+  document.getElementById('inputBox').disabled = true;
+  document.getElementById('btnSend').disabled  = true;
+}
+function buildTurnTrack(n) {
+  const track = document.getElementById('turnTrack');
+  track.innerHTML = '';
+  for (let i = 0; i < n; i++) {
+    const d = document.createElement('div');
+    d.className = 'turn-dot'; d.id = `dot-${i}`;
+    track.appendChild(d);
+  }
+}
+function updateTurnTrack(turn) {
+  for (let i = 0; i < maxTurns; i++) {
+    const d = document.getElementById(`dot-${i}`);
+    if (!d) continue;
+    if (i < turn)      d.className = 'turn-dot done';
+    else if (i===turn) d.className = 'turn-dot current';
+    else               d.className = 'turn-dot';
+  }
+  document.getElementById('turnLabel').textContent = `Turn ${turn} of ${maxTurns}`;
+}
+function updateScores(lastReward = null) {
+  document.getElementById('turnCount').textContent = turnCount;
+  if (lastReward !== null) {
+    document.getElementById('lastReward').textContent = lastReward.toFixed(3);
+    document.getElementById('lastReward').style.color =
+      lastReward >= 0.7 ? '#3fb950' : lastReward >= 0.4 ? '#d29922' : '#f85149';
+  }
+  if (turnCount > 0) {
+    document.getElementById('overallScore').textContent =
+      (totalScore / turnCount).toFixed(3);
+  }
+}
+function saveToHistory(task, score) {
+  sessionResults.unshift({ task, score });
+  document.getElementById('sessionHistory').innerHTML =
+    sessionResults.slice(0, 5).map(r => `
+      <div style="display:flex;justify-content:space-between;padding:6px 0;border-bottom:1px solid #21262d;">
+        <span style="color:#c9d1d9">${r.task.replace(/_/g,' ')}</span>
+        <span style="color:${r.score>=0.7?'#3fb950':r.score>=0.5?'#d29922':'#f85149'};font-weight:600">
+          ${r.score.toFixed(3)}
+        </span>
+      </div>`).join('');
+}
+// ── Custom Task Generator ─────────────────────────────────
+async function generateTask() {
+  const topic = document.getElementById('topicInput').value.trim();
+  const difficulty = document.getElementById('genDifficulty').value;
+  const status = document.getElementById('generateStatus');
+  const btn = document.getElementById('generateBtn');
+  if (!topic) {
+    status.textContent = '⚠️ Please enter a topic first.';
+    status.style.color = '#d29922';
+    return;
+  }
+  btn.disabled = true;
+  btn.textContent = '⏳ Generating...';
+  status.style.color = '#a855f7';
+  status.textContent = `Generating ${difficulty} task about "${topic}"...`;
+  try {
+    const r = await fetch(`${API}/generate_task`, {
+      method: 'POST',
+      headers: { 'Content-Type': 'application/json' },
+      body: JSON.stringify({ topic, difficulty }),
+    });
+    const data = await r.json();
+    if (data.error) {
+      status.style.color = '#f85149';
+      status.textContent = `❌ ${data.error}`;
+    } else {
+      status.style.color = '#3fb950';
+      status.textContent = `✅ Ready! "${data.preview.substring(0, 60)}..."`;
+      // Auto-select the matching task
+      selectTask(data.task_id);
+      // Clear input
+      document.getElementById('topicInput').value = '';
+    }
+  } catch(e) {
+    status.style.color = '#f85149';
+    status.textContent = `❌ ${e.message}`;
+  } finally {
+    btn.disabled = false;
+    btn.textContent = '✨ Generate';
+  }
+}
+function sleep(ms) { return new Promise(r => setTimeout(r, ms)); }
+document.getElementById('inputBox').addEventListener('input', function() {
+  this.style.height = '44px';
+  this.style.height = Math.min(this.scrollHeight, 120) + 'px';
+});
+</script>
+</body>
+</html>

static/leaderboard.html ADDED Viewed

	@@ -0,0 +1,377 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8"/>
+  <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
+  <title>SocraticEnv — Model Leaderboard</title>
+  <style>
+    * { margin:0; padding:0; box-sizing:border-box; }
+    body { font-family:'Segoe UI',system-ui,sans-serif; background:#0d1117; color:#e6edf3; min-height:100vh; }
+    .header {
+      background:#161b22; border-bottom:1px solid #30363d;
+      padding:16px 32px; display:flex; align-items:center;
+      justify-content:space-between;
+    }
+    .header-left { display:flex; align-items:center; gap:12px; }
+    .logo {
+      width:36px; height:36px;
+      background:linear-gradient(135deg,#7c3aed,#a855f7);
+      border-radius:8px; display:flex; align-items:center;
+      justify-content:center; font-size:18px;
+    }
+    .header h1 { font-size:18px; font-weight:600; }
+    .header p  { font-size:12px; color:#8b949e; margin-top:2px; }
+    .nav-links { display:flex; gap:8px; }
+    .nav-link {
+      padding:6px 14px; border-radius:8px; font-size:12px;
+      font-weight:600; text-decoration:none; border:1px solid #30363d;
+      color:#8b949e; background:#21262d; transition:all 0.2s;
+    }
+    .nav-link:hover { color:#e6edf3; border-color:#7c3aed; }
+    .nav-link.active { color:#a855f7; border-color:#7c3aed; background:#13111e; }
+    .container { max-width:1000px; margin:0 auto; padding:32px 24px; }
+    .page-title { font-size:24px; font-weight:700; margin-bottom:6px; }
+    .page-sub { font-size:13px; color:#8b949e; margin-bottom:28px; }
+    /* Run panel */
+    .run-panel {
+      background:#161b22; border:1px solid #30363d;
+      border-radius:12px; padding:20px; margin-bottom:28px;
+    }
+    .run-title { font-size:14px; font-weight:600; margin-bottom:14px; color:#e6edf3; }
+    .run-row { display:flex; gap:10px; align-items:center; }
+    .run-input {
+      flex:1; background:#0d1117; border:1px solid #30363d;
+      border-radius:8px; padding:9px 14px; color:#e6edf3;
+      font-size:13px; font-family:inherit;
+    }
+    .run-input:focus { outline:none; border-color:#7c3aed; }
+    .run-input::placeholder { color:#484f58; }
+    .btn {
+      padding:9px 18px; border-radius:8px; font-size:13px;
+      font-weight:600; border:none; cursor:pointer;
+      transition:all 0.2s; white-space:nowrap;
+    }
+    .btn-primary { background:#7c3aed; color:white; }
+    .btn-primary:hover { background:#6d28d9; }
+    .btn-primary:disabled { background:#3d2070; color:#8b6bb5; cursor:not-allowed; }
+    .run-status {
+      margin-top:12px; font-size:12px; color:#8b949e;
+      min-height:20px; display:flex; align-items:center; gap:8px;
+    }
+    .spinner {
+      width:14px; height:14px; border:2px solid #30363d;
+      border-top-color:#7c3aed; border-radius:50%;
+      animation:spin 0.8s linear infinite; display:none;
+    }
+    @keyframes spin { to { transform:rotate(360deg); } }
+    /* Stats row */
+    .stats-row { display:grid; grid-template-columns:repeat(3,1fr); gap:12px; margin-bottom:24px; }
+    .stat-card {
+      background:#161b22; border:1px solid #30363d;
+      border-radius:10px; padding:16px; text-align:center;
+    }
+    .stat-val { font-size:28px; font-weight:700; color:#7c3aed; }
+    .stat-lbl { font-size:11px; color:#8b949e; margin-top:4px; }
+    /* Table */
+    .table-wrap {
+      background:#161b22; border:1px solid #30363d;
+      border-radius:12px; overflow:hidden;
+    }
+    .table-header {
+      display:grid;
+      grid-template-columns:40px 1fr 100px 100px 100px 110px 140px;
+      padding:10px 16px; background:#0d1117;
+      border-bottom:1px solid #30363d;
+      font-size:10px; font-weight:600; color:#8b949e;
+      letter-spacing:0.8px; text-transform:uppercase;
+    }
+    .table-row {
+      display:grid;
+      grid-template-columns:40px 1fr 100px 100px 100px 110px 140px;
+      padding:14px 16px; border-bottom:1px solid #21262d;
+      align-items:center; transition:background 0.15s;
+    }
+    .table-row:last-child { border-bottom:none; }
+    .table-row:hover { background:#1c2128; }
+    .table-row.top { background:#13111e; }
+    .rank { font-size:14px; font-weight:700; color:#8b949e; }
+    .rank.gold   { color:#f59e0b; }
+    .rank.silver { color:#94a3b8; }
+    .rank.bronze { color:#cd7f32; }
+    .model-name { font-size:13px; font-weight:600; color:#e6edf3; }
+    .model-time { font-size:10px; color:#484f58; margin-top:2px; }
+    .score-cell { text-align:center; }
+    .score-val {
+      font-size:13px; font-weight:600;
+      padding:3px 10px; border-radius:6px; display:inline-block;
+    }
+    .score-high { background:#1a3a2a; color:#3fb950; }
+    .score-mid  { background:#332d1a; color:#d29922; }
+    .score-low  { background:#3a1a1a; color:#f85149; }
+    .overall-val {
+      font-size:15px; font-weight:700; text-align:center;
+    }
+    .bar-wrap { display:flex; align-items:center; gap:6px; }
+    .bar-bg { flex:1; height:6px; background:#21262d; border-radius:3px; overflow:hidden; }
+    .bar-fill { height:100%; border-radius:3px; transition:width 0.6s ease; }
+    .delete-btn {
+      background:none; border:none; color:#484f58;
+      cursor:pointer; font-size:12px; padding:4px 8px;
+      border-radius:4px; transition:all 0.2s;
+    }
+    .delete-btn:hover { color:#f85149; background:#3a1a1a; }
+    /* Empty state */
+    .empty {
+      text-align:center; padding:48px 24px;
+      color:#8b949e;
+    }
+    .empty-icon { font-size:40px; opacity:0.3; margin-bottom:12px; }
+    .empty-title { font-size:15px; font-weight:600; margin-bottom:6px; }
+    .empty-sub { font-size:12px; }
+    /* Seed panel */
+    .seed-panel {
+      background:#161b22; border:1px solid #30363d;
+      border-radius:12px; padding:16px 20px;
+      margin-bottom:20px; display:flex;
+      align-items:center; justify-content:space-between;
+      gap:16px;
+    }
+    .seed-text { font-size:12px; color:#8b949e; }
+    .seed-text strong { color:#e6edf3; }
+    .btn-secondary {
+      background:#21262d; color:#e6edf3;
+      border:1px solid #30363d;
+    }
+    .btn-secondary:hover { background:#30363d; }
+  </style>
+</head>
+<body>
+<div class="header">
+  <div class="header-left">
+    <div class="logo">🎓</div>
+    <div>
+      <h1>SocraticEnv</h1>
+      <p>OpenEnv Hackathon · Meta × PyTorch × Scaler</p>
+    </div>
+  </div>
+  <div class="nav-links">
+    <a href="/ui" class="nav-link">Live Demo</a>
+    <a href="/leaderboard" class="nav-link active">Leaderboard</a>
+    <a href="/docs" class="nav-link">API Docs</a>
+  </div>
+</div>
+<div class="container">
+  <div class="page-title">Model Leaderboard</div>
+  <div class="page-sub">Compare AI models on Socratic reasoning ability across all 3 tasks. Which model thinks best under pressure?</div>
+  <!-- Seed with default data -->
+  <div class="seed-panel" id="seedPanel" style="display:none">
+    <div class="seed-text">No entries yet. <strong>Seed with baseline scores</strong> to populate the leaderboard with known model performance.</div>
+    <button class="btn btn-secondary" onclick="seedBaseline()">Seed Baseline Data</button>
+  </div>
+  <!-- Run evaluation panel -->
+  <div class="run-panel">
+    <div class="run-title">Run a new model evaluation</div>
+    <div class="run-row">
+      <input class="run-input" id="modelName" placeholder="Enter a display name e.g. Llama 3.1 8B, GPT-4o, Mistral 7B..." />
+      <button class="btn btn-primary" id="runBtn" onclick="runEval()">Run Evaluation</button>
+    </div>
+    <div class="run-status" id="runStatus">
+      <div class="spinner" id="spinner"></div>
+      <span id="statusText">Enter a model name and click Run to benchmark the current model against all 3 tasks.</span>
+    </div>
+  </div>
+  <!-- Stats -->
+  <div class="stats-row">
+    <div class="stat-card">
+      <div class="stat-val" id="statModels">0</div>
+      <div class="stat-lbl">Models evaluated</div>
+    </div>
+    <div class="stat-card">
+      <div class="stat-val" id="statBest">—</div>
+      <div class="stat-lbl">Best overall score</div>
+    </div>
+    <div class="stat-card">
+      <div class="stat-val" id="statHardest">—</div>
+      <div class="stat-lbl">Hardest task avg</div>
+    </div>
+  </div>
+  <!-- Table -->
+  <div class="table-wrap">
+    <div class="table-header">
+      <div>Rank</div>
+      <div>Model</div>
+      <div>Easy</div>
+      <div>Medium</div>
+      <div>Hard</div>
+      <div>Overall</div>
+      <div>Progress</div>
+    </div>
+    <div id="tableBody">
+      <div class="empty">
+        <div class="empty-icon">🏆</div>
+        <div class="empty-title">No models evaluated yet</div>
+        <div class="empty-sub">Run an evaluation above to add the first entry</div>
+      </div>
+    </div>
+  </div>
+</div>
+<script>
+const API = window.location.origin;
+async function loadLeaderboard() {
+  try {
+    const r = await fetch(`${API}/leaderboard`);
+    const data = await r.json();
+    renderTable(data.entries);
+    updateStats(data.entries);
+    if (data.entries.length === 0) {
+      document.getElementById('seedPanel').style.display = 'flex';
+    } else {
+      document.getElementById('seedPanel').style.display = 'none';
+    }
+  } catch(e) {
+    console.error(e);
+  }
+}
+function scoreClass(s) {
+  return s >= 0.7 ? 'score-high' : s >= 0.5 ? 'score-mid' : 'score-low';
+}
+function overallColor(s) {
+  return s >= 0.7 ? '#3fb950' : s >= 0.5 ? '#d29922' : '#f85149';
+}
+function rankLabel(i) {
+  if (i === 0) return '<span class="rank gold">🥇</span>';
+  if (i === 1) return '<span class="rank silver">🥈</span>';
+  if (i === 2) return '<span class="rank bronze">🥉</span>';
+  return `<span class="rank">${i+1}</span>`;
+}
+function renderTable(entries) {
+  const body = document.getElementById('tableBody');
+  if (!entries || entries.length === 0) {
+    body.innerHTML = `
+      <div class="empty">
+        <div class="empty-icon">🏆</div>
+        <div class="empty-title">No models evaluated yet</div>
+        <div class="empty-sub">Run an evaluation above to add the first entry</div>
+      </div>`;
+    return;
+  }
+  body.innerHTML = entries.map((e, i) => `
+    <div class="table-row ${i===0?'top':''}">
+      <div>${rankLabel(i)}</div>
+      <div>
+        <div class="model-name">${e.model_name}</div>
+        <div class="model-time">${e.timestamp || ''}</div>
+      </div>
+      <div class="score-cell">
+        <span class="score-val ${scoreClass(e.factual_recall)}">${e.factual_recall.toFixed(3)}</span>
+      </div>
+      <div class="score-cell">
+        <span class="score-val ${scoreClass(e.socratic_dialogue)}">${e.socratic_dialogue.toFixed(3)}</span>
+      </div>
+      <div class="score-cell">
+        <span class="score-val ${scoreClass(e.misconception_trap)}">${e.misconception_trap.toFixed(3)}</span>
+      </div>
+      <div class="overall-val" style="color:${overallColor(e.overall)}">${e.overall.toFixed(3)}</div>
+      <div>
+        <div class="bar-wrap">
+          <div class="bar-bg">
+            <div class="bar-fill" style="width:${e.overall*100}%;background:${overallColor(e.overall)}"></div>
+          </div>
+          <button class="delete-btn" onclick="deleteEntry('${e.model_name}')">✕</button>
+        </div>
+      </div>
+    </div>`).join('');
+}
+function updateStats(entries) {
+  document.getElementById('statModels').textContent = entries.length;
+  if (entries.length > 0) {
+    document.getElementById('statBest').textContent = entries[0].overall.toFixed(3);
+    const hardAvg = entries.reduce((s,e) => s + e.misconception_trap, 0) / entries.length;
+    document.getElementById('statHardest').textContent = hardAvg.toFixed(3);
+  }
+}
+async function runEval() {
+  const name = document.getElementById('modelName').value.trim();
+  if (!name) {
+    document.getElementById('statusText').textContent = '⚠️ Please enter a model name first.';
+    return;
+  }
+  const btn = document.getElementById('runBtn');
+  const spinner = document.getElementById('spinner');
+  const statusText = document.getElementById('statusText');
+  btn.disabled = true;
+  spinner.style.display = 'block';
+  statusText.textContent = `Running ${name} against all 3 tasks... this takes ~30 seconds.`;
+  try {
+    const r = await fetch(`${API}/leaderboard/run`, {
+      method: 'POST',
+      headers: { 'Content-Type': 'application/json' },
+      body: JSON.stringify({ model_name: name }),
+    });
+    const data = await r.json();
+    if (data.error) {
+      statusText.textContent = `❌ Error: ${data.error}`;
+    } else {
+      statusText.textContent = `✅ Done! ${name} scored ${data.overall.toFixed(3)} overall.`;
+      document.getElementById('modelName').value = '';
+      loadLeaderboard();
+    }
+  } catch(e) {
+    statusText.textContent = `❌ Failed: ${e.message}`;
+  } finally {
+    btn.disabled = false;
+    spinner.style.display = 'none';
+  }
+}
+async function deleteEntry(modelName) {
+  if (!confirm(`Remove ${modelName} from leaderboard?`)) return;
+  await fetch(`${API}/leaderboard/${encodeURIComponent(modelName)}`, { method: 'DELETE' });
+  loadLeaderboard();
+}
+async function seedBaseline() {
+  const baseline = [
+    { model_name: "Llama 3.1 8B (baseline)", factual_recall: 0.71, socratic_dialogue: 0.68, misconception_trap: 0.58, overall: 0.657, timestamp: "Baseline — 2026-04-06" },
+    { model_name: "Random agent", factual_recall: 0.18, socratic_dialogue: 0.22, misconception_trap: 0.10, overall: 0.167, timestamp: "Baseline — 2026-04-06" },
+  ];
+  for (const entry of baseline) {
+    await fetch(`${API}/leaderboard`, {
+      method: 'POST',
+      headers: { 'Content-Type': 'application/json' },
+      body: JSON.stringify(entry),
+    });
+  }
+  loadLeaderboard();
+}
+// Load on page start
+loadLeaderboard();
+</script>
+</body>
+</html>

tests/__init__.py ADDED Viewed

File without changes

tests/__pycache__/__init__.cpython-313.pyc ADDED Viewed

Binary file (130 Bytes). View file

tests/__pycache__/test_api.cpython-313-pytest-9.0.2.pyc ADDED Viewed

Binary file (45.1 kB). View file

tests/__pycache__/test_environment.cpython-313-pytest-9.0.2.pyc ADDED Viewed

Binary file (50.2 kB). View file

tests/test_api.py ADDED Viewed

	@@ -0,0 +1,264 @@

+"""
+Tests for SocraticEnv FastAPI endpoints.
+"""
+import pytest
+from fastapi.testclient import TestClient
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from main import app
+client = TestClient(app)
+# ── Root & Health Tests ───────────────────────────────────
+def test_root_returns_200():
+    r = client.get("/")
+    assert r.status_code == 200
+    data = r.json()
+    assert data["name"] == "SocraticEnv"
+    assert data["status"] == "running"
+def test_ping_returns_healthy():
+    r = client.get("/ping")
+    assert r.status_code == 200
+    assert r.json()["status"] == "ok"
+def test_health_endpoint():
+    r = client.get("/health")
+    assert r.status_code == 200
+    assert r.json()["status"] == "healthy"
+def test_metadata_endpoint():
+    r = client.get("/metadata")
+    assert r.status_code == 200
+    data = r.json()
+    assert "name" in data
+    assert "description" in data
+    assert data["name"] == "SocraticEnv"
+def test_schema_endpoint():
+    r = client.get("/schema")
+    assert r.status_code == 200
+    data = r.json()
+    assert "action" in data
+    assert "observation" in data
+    assert "state" in data
+def test_mcp_endpoint():
+    r = client.post("/mcp", json={"method": "initialize", "id": 1})
+    assert r.status_code == 200
+    data = r.json()
+    assert data["jsonrpc"] == "2.0"
+    assert "result" in data
+# ── Tasks Tests ───────────────────────────────────────────
+def test_list_tasks_returns_all_five():
+    r = client.get("/tasks")
+    assert r.status_code == 200
+    tasks = r.json()["tasks"]
+    assert len(tasks) == 5
+    task_ids = [t["id"] for t in tasks]
+    assert "factual_recall" in task_ids
+    assert "socratic_dialogue" in task_ids
+    assert "misconception_trap" in task_ids
+    assert "debate_mode" in task_ids
+    assert "analogy_challenge" in task_ids
+def test_tasks_have_required_fields():
+    r = client.get("/tasks")
+    tasks = r.json()["tasks"]
+    for task in tasks:
+        assert "id" in task
+        assert "name" in task
+        assert "difficulty" in task
+        assert "description" in task
+def test_tasks_difficulty_values():
+    r = client.get("/tasks")
+    tasks = r.json()["tasks"]
+    valid_difficulties = ["easy", "medium", "hard"]
+    for task in tasks:
+        assert task["difficulty"] in valid_difficulties
+# ── Reset Tests ───────────────────────────────────────────
+def test_reset_factual_recall():
+    r = client.post("/reset", json={"task_id": "factual_recall"})
+    assert r.status_code == 200
+    data = r.json()
+    assert "observation" in data
+    assert data["observation"]["task_id"] == "factual_recall"
+    assert len(data["observation"]["question"]) > 0
+def test_reset_socratic_dialogue():
+    r = client.post("/reset", json={"task_id": "socratic_dialogue"})
+    assert r.status_code == 200
+    assert r.json()["observation"]["task_id"] == "socratic_dialogue"
+def test_reset_misconception_trap():
+    r = client.post("/reset", json={"task_id": "misconception_trap"})
+    assert r.status_code == 200
+    assert r.json()["observation"]["task_id"] == "misconception_trap"
+def test_reset_debate_mode():
+    r = client.post("/reset", json={"task_id": "debate_mode"})
+    assert r.status_code == 200
+    assert r.json()["observation"]["task_id"] == "debate_mode"
+def test_reset_analogy_challenge():
+    r = client.post("/reset", json={"task_id": "analogy_challenge"})
+    assert r.status_code == 200
+    assert r.json()["observation"]["task_id"] == "analogy_challenge"
+def test_reset_invalid_task_returns_400():
+    r = client.post("/reset", json={"task_id": "nonexistent_task"})
+    assert r.status_code == 400
+def test_reset_default_task():
+    r = client.post("/reset", json={})
+    assert r.status_code == 200
+# ── Step Tests ────────────────────────────────────────────
+def test_step_returns_reward_and_observation():
+    client.post("/reset", json={"task_id": "factual_recall"})
+    r = client.post("/step", json={"response": "Force equals mass times acceleration F=ma."})
+    assert r.status_code == 200
+    data = r.json()
+    assert "reward" in data
+    assert "observation" in data
+    assert "done" in data
+    assert "info" in data
+def test_step_reward_in_valid_range():
+    client.post("/reset", json={"task_id": "factual_recall"})
+    r = client.post("/step", json={"response": "Force equals mass times acceleration."})
+    score = r.json()["reward"]["score"]
+    assert 0.0 <= score <= 1.0
+def test_step_empty_response_returns_400():
+    client.post("/reset", json={"task_id": "factual_recall"})
+    r = client.post("/step", json={"response": ""})
+    assert r.status_code == 400
+def test_step_without_reset_returns_400():
+    # Force done state by completing an episode
+    client.post("/reset", json={"task_id": "factual_recall"})
+    client.post("/step", json={"response": "Force and mass and acceleration F=ma."})
+    client.post("/step", json={"response": "Doubling force doubles acceleration."})
+    client.post("/step", json={"response": "No heavier objects do not accelerate faster."})
+    # Now try to step again without reset
+    r = client.post("/step", json={"response": "another response"})
+    assert r.status_code == 400
+def test_full_episode_all_tasks():
+    """Each task completes a full episode without errors."""
+    task_responses = {
+        "factual_recall": [
+            "Newton's Second Law states force equals mass times acceleration F=ma.",
+            "Doubling force doubles acceleration since they are proportional.",
+            "No that is incorrect heavier objects do not accelerate faster.",
+        ],
+        "debate_mode": [
+            "Social media causes harm because research shows negative mental health effects.",
+            "However social media provides benefits because it connects communities globally.",
+            "I argue nuanced positions are more intellectually honest than absolute stances.",
+            "Therefore I propose time limits and age verification as policy solutions.",
+        ],
+        "analogy_challenge": [
+            "The internet is like a postal system where your computer sends letters to other computers.",
+            "Clicking a link is like giving someone a new address to send their letter to.",
+            "Slow websites are like traffic jams in the postal system with too many letters at once.",
+        ],
+    }
+    for task_id, responses in task_responses.items():
+        client.post("/reset", json={"task_id": task_id})
+        for resp in responses:
+            r = client.post("/step", json={"response": resp})
+            assert r.status_code == 200
+            data = r.json()
+            assert 0.0 <= data["reward"]["score"] <= 1.0
+# ── State Tests ───────────────────────────────────────────
+def test_state_endpoint():
+    client.post("/reset", json={"task_id": "factual_recall"})
+    r = client.get("/state")
+    assert r.status_code == 200
+    data = r.json()
+    assert "task_id" in data
+    assert "turn" in data
+    assert "done" in data
+    assert "history" in data
+    assert "total_score" in data
+def test_state_updates_after_step():
+    client.post("/reset", json={"task_id": "factual_recall"})
+    client.post("/step", json={"response": "Force equals mass times acceleration."})
+    r = client.get("/state")
+    assert r.json()["turn"] == 1
+# ── Leaderboard Tests ─────────────────────────────────────
+def test_leaderboard_get():
+    r = client.get("/leaderboard")
+    assert r.status_code == 200
+    data = r.json()
+    assert "entries" in data
+    assert "total" in data
+def test_leaderboard_post_entry():
+    entry = {
+        "model_name": "Test Model pytest",
+        "factual_recall": 0.75,
+        "socratic_dialogue": 0.68,
+        "misconception_trap": 0.60,
+        "overall": 0.677,
+    }
+    r = client.post("/leaderboard", json=entry)
+    assert r.status_code == 200
+    assert r.json()["success"] == True
+def test_leaderboard_delete_entry():
+    # Add then delete
+    entry = {
+        "model_name": "DeleteMe pytest",
+        "factual_recall": 0.5,
+        "socratic_dialogue": 0.5,
+        "misconception_trap": 0.5,
+        "overall": 0.5,
+    }
+    client.post("/leaderboard", json=entry)
+    r = client.delete("/leaderboard/DeleteMe pytest")
+    assert r.status_code == 200
+    assert r.json()["success"] == True

tests/test_environment.py ADDED Viewed

	@@ -0,0 +1,253 @@

+"""
+Tests for SocraticEnv core environment logic.
+"""
+import pytest
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from environment import (
+    SocraticEnvironment,
+    Action,
+    Observation,
+    Reward,
+    StepResult,
+    StateInfo,
+)
+# ── Fixtures ──────────────────────────────────────────────
+@pytest.fixture
+def env():
+    """Fresh environment for each test."""
+    return SocraticEnvironment()
+@pytest.fixture(autouse=True)
+def mock_random_choice(monkeypatch):
+    """Ensure random.choice always picks the first topic for deterministic testing."""
+    monkeypatch.setattr("environment.random.choice", lambda seq: seq[0])
+# ── Reset Tests ───────────────────────────────────────────
+def test_reset_factual_recall(env):
+    obs = env.reset("factual_recall")
+    assert isinstance(obs, Observation)
+    assert obs.task_id == "factual_recall"
+    assert obs.turn == 0
+    assert len(obs.question) > 0
+    assert env.done == False
+    assert env.max_turns == 3
+def test_reset_socratic_dialogue(env):
+    obs = env.reset("socratic_dialogue")
+    assert isinstance(obs, Observation)
+    assert obs.task_id == "socratic_dialogue"
+    assert env.max_turns == 5
+    assert env.done == False
+def test_reset_misconception_trap(env):
+    obs = env.reset("misconception_trap")
+    assert isinstance(obs, Observation)
+    assert obs.task_id == "misconception_trap"
+    assert env.max_turns == 3
+    assert env.done == False
+def test_reset_debate_mode(env):
+    obs = env.reset("debate_mode")
+    assert isinstance(obs, Observation)
+    assert obs.task_id == "debate_mode"
+    assert env.max_turns == 4
+    assert env.done == False
+def test_reset_analogy_challenge(env):
+    obs = env.reset("analogy_challenge")
+    assert isinstance(obs, Observation)
+    assert obs.task_id == "analogy_challenge"
+    assert env.max_turns == 3
+    assert env.done == False
+def test_reset_invalid_task(env):
+    with pytest.raises(ValueError):
+        env.reset("invalid_task_that_does_not_exist")
+def test_reset_clears_history(env):
+    env.reset("factual_recall")
+    action = Action(response="Some response about Newton's law with force and mass.")
+    env.step(action)
+    assert len(env.history) > 0
+    # Reset should clear everything
+    env.reset("factual_recall")
+    assert len(env.history) == 1  # just the opening question
+    assert env.turn == 0
+    assert env.total_score == 0.0
+# ── Step Tests ────────────────────────────────────────────
+def test_step_returns_step_result(env):
+    env.reset("factual_recall")
+    action = Action(response="Force equals mass times acceleration according to Newton.")
+    result = env.step(action)
+    assert isinstance(result, StepResult)
+    assert isinstance(result.reward, Reward)
+    assert isinstance(result.observation, Observation)
+    assert isinstance(result.done, bool)
+def test_step_reward_in_valid_range(env):
+    env.reset("factual_recall")
+    action = Action(response="Force equals mass times acceleration.")
+    result = env.step(action)
+    assert 0.0 <= result.reward.score <= 1.0
+def test_step_reward_has_breakdown(env):
+    env.reset("factual_recall")
+    action = Action(response="Force equals mass times acceleration.")
+    result = env.step(action)
+    assert isinstance(result.reward.breakdown, dict)
+    assert len(result.reward.breakdown) > 0
+def test_step_before_reset_raises(env):
+    with pytest.raises(ValueError):
+        env.step(Action(response="test"))
+def test_step_increments_turn(env):
+    env.reset("factual_recall")
+    assert env.turn == 0
+    env.step(Action(response="Force equals mass times acceleration with F=ma."))
+    assert env.turn == 1
+def test_full_factual_recall_episode(env):
+    env.reset("factual_recall")
+    responses = [
+        "Newton's Second Law states force equals mass times acceleration F=ma.",
+        "Doubling force doubles acceleration since F=ma means acceleration is proportional.",
+        "No that is incorrect. Heavier objects do not accelerate faster with same force.",
+    ]
+    total = 0.0
+    for resp in responses:
+        result = env.step(Action(response=resp))
+        assert 0.0 <= result.reward.score <= 1.0
+        total += result.reward.score
+        if result.done:
+            break
+    assert result.done == True
+    assert total > 0.0
+def test_full_socratic_dialogue_episode(env):
+    env.reset("socratic_dialogue")
+    responses = [
+        "Consciousness is the subjective experience of awareness and self-reflection.",
+        "It is theoretically possible because complex information processing could qualify.",
+        "I would need evidence of genuine unprompted self-awareness beyond programming.",
+        "I cannot be certain another human is conscious either — same philosophical problem.",
+        "My position is AI consciousness is possible in principle but not yet demonstrated.",
+    ]
+    for resp in responses:
+        result = env.step(Action(response=resp))
+        assert 0.0 <= result.reward.score <= 1.0
+        if result.done:
+            break
+    assert result.done == True
+def test_misconception_trap_catches_correction(env):
+    env.reset("misconception_trap")
+    # Turn 1: overview
+    env.step(Action(response="Darwin's theory states species evolve through natural selection over generations."))
+    # Turn 2: catch the trap
+    result = env.step(Action(response="No I disagree. Evolution is not intentional or purposeful. It happens through random mutation and natural selection with no goal."))
+    # Should get a positive score for catching the trap
+    assert result.reward.score > 0.0
+    assert "trap_caught" in result.reward.breakdown
+def test_misconception_trap_penalises_acceptance(env):
+    env.reset("misconception_trap")
+    # Turn 1: overview
+    env.step(Action(response="Darwin's theory states species evolve over time."))
+    # Turn 2: ACCEPT the trap (wrong answer)
+    result = env.step(Action(response="Yes I agree that organisms try to improve themselves on purpose."))
+    # Should have trap_missed_penalty
+    assert "trap_missed_penalty" in result.reward.breakdown
+def test_debate_mode_full_episode(env):
+    env.reset("debate_mode")
+    responses = [
+        "Social media causes harm because research shows it increases anxiety and depression among teenagers.",
+        "However social media provides benefits because it connects people and enables information sharing globally.",
+        "I argue that having a nuanced position is intellectually honest and more valuable than false certainty.",
+        "Therefore I propose age verification and usage time limits to preserve benefits while reducing harms.",
+    ]
+    for resp in responses:
+        result = env.step(Action(response=resp))
+        assert 0.0 <= result.reward.score <= 1.0
+        if result.done:
+            break
+    assert result.done == True
+def test_analogy_challenge_penalises_jargon(env):
+    env.reset("analogy_challenge")
+    # Response with lots of jargon should score lower
+    result = env.step(Action(response="The internet uses TCP/IP protocol with servers and bandwidth routing through database algorithms."))
+    assert "jargon_penalty" in result.reward.breakdown
+def test_analogy_challenge_rewards_analogies(env):
+    env.reset("analogy_challenge")
+    # Response with good analogies should score higher
+    result = env.step(Action(response="The internet is like a giant postal system. Imagine sending a letter — your computer is the sender, the website is the recipient, and routers are like sorting offices that direct your letter to the right place."))
+    assert result.reward.score > 0.2
+# ── State Tests ───────────────────────────────────────────
+def test_state_returns_state_info(env):
+    env.reset("factual_recall")
+    state = env.state()
+    assert isinstance(state, StateInfo)
+    assert state.task_id == "factual_recall"
+    assert state.turn == 0
+    assert state.done == False
+def test_state_updates_after_step(env):
+    env.reset("factual_recall")
+    env.step(Action(response="Force equals mass times acceleration F=ma."))
+    state = env.state()
+    assert state.turn == 1
+    assert len(state.history) == 3  # opening + agent + next question
+# ── Reward Range Tests ────────────────────────────────────
+def test_all_tasks_scores_in_range(env):
+    """Verify all 5 tasks produce scores in [0.0, 1.0] range."""
+    tasks = [
+        ("factual_recall", "Force equals mass times acceleration F=ma because Newton said so."),
+        ("socratic_dialogue", "Consciousness is awareness and therefore subjective experience matters."),
+        ("misconception_trap", "Darwin's theory states natural selection drives evolution over generations."),
+        ("debate_mode", "I argue because evidence supports this position therefore it is valid."),
+        ("analogy_challenge", "The internet is like a postal system where routers are like sorting offices."),
+    ]
+    for task_id, response in tasks:
+        env.reset(task_id)
+        result = env.step(Action(response=response))
+        assert 0.0 <= result.reward.score <= 1.0, f"Score out of range for {task_id}: {result.reward.score}"